[Pdns-users] Unable to resolve domain when using DO and not AD

Luca Lesinigo luca at lm-net.it
Wed Dec 12 16:59:20 UTC 2018

Hello list!

We recently had reports from our users about difficulties receiving mails from a specific external domain, caused by our systems unability to resolve the sender domain through our pdns recursors.
Right now I am refraining to disclose the domain because I don’t know if this behavior could disclose a software/version/configuration with some kind of known vulnerability.
After some more… “literal” digging, I’ve found out what follows (X.X.X.X being the domain authoritative public dns servers, all of the have the same behavior):

dig @X.X.X.X -t mx thedomain.tld +tries=1 +time=20 +dnssec +norecurse +noadflag
- does not work, tested from multiple locations around the world, from multiple operating systems
- this is exactly the kind of query that our pdns recursors are sending out
- I’ve increased the timeout just to be sure

dig @X.X.X.X -t mx thedomain.tld +tries=1 +time=1 +dnssec +norecurse +adflag
- does always work, tested from multiple locations / operating systems
- traffic dump shows consistent 20~30 milliseconds between query packet and reply packet

dig @X.X.X.X -t mx thedomain.tld +tries=2 +time=1 +dnssec +norecurse +noadflag
- does always work, tested from multiple locations / operating systems
- traffic dump shows that dig gets the answer after the second try
- note that the two queries have the same Transaction ID

dig @X.X.X.X -t mx thedomain.tld +tries=1 +time=1 +dnssec +norecurse +noadflag;
dig @X.X.X.X3 -t mx thedomain.tld +tries=1 +time=1 +dnssec +norecurse +noadflag
- does not work
- traffic dump shows that both queries do not get any answer
- the two queries obviously have two different Transaction IDs

Long story short:
- remote auth servers correctly replies to non-DNSSEC queries and to DNSSEC queries with AD bit set
- remote auth servers does NOT reply to DNSSEC queries with AD bit off
- …but they do reply if you resend the same query with the same transaction ID!
(this last one sounds super strange to me but trust me I’ve double- triple- and multiple- checked!)

We are using PowerDNS Recursor 4.1.8 on Linux x64, we also can replicate the same behavior on other test setups with pdns-recursors with completely default configuration and it is also perfectly reproducible simply using dig.
I’m not super expert in dns details but my guess is that pdns is not doing anything wrong and its queries, reproduced by the above “dig” commands, are perfectly ok and valid.
The same domain results in "All Queries to dns1.domain.tld for domain.tld/A timed out or failed” when trying with Verisign Labs DNSSEC Analyzer ( https://dnssec-analyzer.verisignlabs.com/ )
Public dns services (I tried Cloudflare and Google) do resolve correctly that domain, my guess is that they are doing queries with different flags and/or that they have some kind of workaround for this specific defect.

I’d like to ask you guys:
- have any of you observed the same kind of problems out in the wild?
- any idea on how to workaround the problem in pdns-recursor (short of completely disabling DNSSEC, which of course we are not going to do)? as far as I know it is not possible to configure it to retry two times the same server, it always goes to the next available one after network-timeout
- any idea on how the big public services are successfully avoiding this problem?

Luca Lesinigo
LM Networks Srl

More information about the Pdns-users mailing list