[Pdns-users] pdns-recursur 4.4: host unknown after some time with no clear reason

Jan Huijsmans bofh at koffie.nu
Wed Jun 1 09:10:57 UTC 2022


Hello,

We have a strange problem in one of our airgapped environments while we
use the same setup in others where we don't have the issue. After some
time (varies form seconds to hours), the recursor refuses to give any
answer other then host unknown (SRV_FAIL when I remember correctly).

Situation:

Airgapped environment with 2 DNS servers, each with:
* recursor listening to internal interface
* authoritive listening to external interface
* DNS lookups trough recursor via external simulated root server to
  designated authoritives

The problem exists within 1 environment where the links to external
authoritive servers for root and other domains are slow (1 Mbit or less)
and some zones (including root) have very interesting NS records. (NS
with hostnames with missing A records) For the root zone this is fixed,
but some others still are messy. After a while, the recursor refuses to
give ansers to any query, no matter if the DNS server that should
answer is configured correctly or not. The only thing that helps in that
situation is a restart of the recursor.

With log level at max (9) all we see at the moment of the issue is that
the recursor answers from packet cache, with no attempts to query
externally. The last query in the log is also not remarkable just
either works (valid query) or doesn't (invalid query to domains unknown
in the environment), no indication of throtteling, timeouts, missed
packets or long responce times.

When the problem shows up, dig @<recursor ip> fails. However, the moment
we use the +trace option, the dig command works around the recursor
after the 1st lookup (NS of .) and gets the answer correctly.

We can't seem to reproduce the error in the other environments, can't
get logging that points to the issue (log level 9 is max?) or even
think of a logical reason why this would happen (apart from
throtteling). We've set option dont-throttle-netmasks to 0.0.0.0/0 which
seems to help a lot, but not solve the problem completely.

I'd try to set non-resolving-ns-max-fails to 0 when we were on 4.5. but
alas we're stuck at 4.4 at the moment (no way to upgrade the airgapped
environment).

We need either a way to keep the recursors querying the NS servers
to get an answer, or be able to prove which server/environment is the
cause of the issue.

-- 

Jan Huijsmans              bofh at koffie.nu

... cannot activate /dev/brain, no response from main coffee server


More information about the Pdns-users mailing list