[Pdns-users] What does "had X timeouts" on a slave mean and how to debug?

Joshua M. Boniface joshua at boniface.me
Wed Sep 6 22:12:49 UTC 2017


Hello list!

I'm experiencing a problem with my PowerDNS slave machines not 
performing AXFRs when notified by the master of a zone reload, and I've 
been unable to find any satisfactory answers. First, some of the 
pertinent questions:

  * What PowerDNS or dnsdist version you are using
    /PowerDNS 4.0.3/
  * What backends you use (gMySQL, BIND etc)
    /gMySQL/
  * If you are using DNSSEC
    /Yes, the zones in question are signed on the master/
  * Operating system it is running on
    /Debian Stretch (9.1)/
  * Where you got the binary from, i.e. did you compile it yourself
    (which compiler, including version), from your distribution or ports
    tree etc
    /Using version from official Debian Stretch repositories/

The specific problem I'm having can be summed up by the following two 
log lines on the slave:

 > Sep 06 14:31:33 2 slave domains need checking, 0 queued for AXFR
 > Sep 06 14:31:36 Received serial number updates for 0 zones, had 2 
timeouts

I am triggering updates via the DHCP DDNS facility documented here: 
https://doc.powerdns.com/md/authoritative/dnsupdate/#dns-update-how-to-setup-dyndnsrfc2136-with-dhcpd, 
however this problem also occurs occasionally though less frequently 
when I manually update a zone in the MySQL backend and trigger a 
"pdnsutil increase-serial <domain>" followed by a "pdns_control reload 
<domain>". This is occurring with many domains as well, though due to 
the DHCP DDNS I'm focusing on only two.

I would expect zero timeouts, however at least 50% of the time both 
relevant zone transfers time out as shown in the log above. 40% of the 
time one times out, and only 10% of the time (based on about two dozen 
tests) do both zones AXFR as expected on both slave hosts. The timeout 
time is always exactly 3 seconds after the NOTIFY is received, and I've 
not found any setting in the documentation to increase this time.

I did a packet capture on both sides of the transaction (both master and 
slave), and I see the NOTIFY packet sent by the master and acknowledged 
on the slave, however when I see the timeouts I never see the slave send 
any sort of AXFR request as I would expect. The notifies are received 
without error to both hosts every time. Given the lack of relevant 
information in these I have not included them.

There is no firewalling or other network interruptions between the 
master and slave and they are on the same broadcast domain. And as 
mentioned, sometimes at least one domain succeeds and I see the AXFR 
request sent from the slave to the master. This occurs intermittently 
between both of my slave machines, so this does not appear to be network 
related. I also confirmed that there are no errors related to the MySQL 
backend on the master.

Increasing the loglevel to 9+ on both sides does not reveal any further 
information as to the timeout. So ultimately, my question is: *[tl;dr]* 
What exactly does "had X timeouts" mean on the slave (i.e. what is 
timing out), and what can I do to diagnose the cause or prevent it from 
happening?

Here is the full log from the master with loglevel=9 (with the main 
domain changed here to `domain.tld` for privacy purposes; as previously 
mentioned this is occurring for multiple domains anyways) for a sample 
run. In this particular case, my second slave server (10.9.0.66) 
properly responded with AXFR requests for both updated domains, however 
the first slave server (10.9.0.65) did not. This is fairly consistent 
behaviour for me regarding this problem - sometimes the first responds 
with one or both, other times the second responds, but almost invariably 
one zone on one host does not AXFR.

 > Sep 06 14:31:33 dns.i.domain.tld pdns_server[30635]: Sep 06 14:31:33 
2 domains for which we are master need notifications
 > Sep 06 14:31:33 dns.i.domain.tld pdns_server[30635]: Sep 06 14:31:33 
Queued notification of domain 'i.domain.tld' to 10.9.0.65:53
 > Sep 06 14:31:33 dns.i.domain.tld pdns_server[30635]: Sep 06 14:31:33 
Queued notification of domain 'i.domain.tld' to 10.9.0.66:53
 > Sep 06 14:31:33 dns.i.domain.tld pdns_server[30635]: Sep 06 14:31:33 
Queued notification of domain '0.9.10.in-addr.arpa' to 10.9.0.65:53
 > Sep 06 14:31:33 dns.i.domain.tld pdns_server[30635]: Sep 06 14:31:33 
Queued notification of domain '0.9.10.in-addr.arpa' to 10.9.0.66:53
 > Sep 06 14:31:33 dns.i.domain.tld pdns_server[30635]: Sep 06 14:31:33 
gmysql Connection successful. Connected to database 'dns' on 
'db.domain.tld'. [3 repetitions removed]
 > Sep 06 14:31:33 dns.i.domain.tld pdns_server[30635]: Sep 06 14:31:33 
Remote 10.9.0.66 wants '0.9.10.in-addr.arpa|SOA', do = 1, bufsize = 
1680: packetcache MISS
 > Sep 06 14:31:33 dns.i.domain.tld pdns_server[30635]: Sep 06 14:31:33 
Remote 10.9.0.66 wants 'i.domain.tld|SOA', do = 0, bufsize = 512: 
packetcache MISS
 > Sep 06 14:31:33 dns.i.domain.tld pdns_server[30635]: Sep 06 14:31:33 
AXFR of domain '0.9.10.in-addr.arpa' initiated by 10.9.0.66
 > Sep 06 14:31:33 dns.i.domain.tld pdns_server[30635]: Sep 06 14:31:33 
AXFR of domain '0.9.10.in-addr.arpa' allowed: client IP 10.9.0.66 is in 
allow-axfr-ips
 > Sep 06 14:31:33 dns.i.domain.tld pdns_server[30635]: Sep 06 14:31:33 
AXFR of domain 'i.domain.tld' initiated by 10.9.0.66
 > Sep 06 14:31:33 dns.i.domain.tld pdns_server[30635]: Sep 06 14:31:33 
AXFR of domain 'i.domain.tld' allowed: client IP 10.9.0.66 is in 
allow-axfr-ips
 > Sep 06 14:31:33 dns.i.domain.tld pdns_server[30635]: Sep 06 14:31:33 
gmysql Connection successful. Connected to database 'dns' on 
'db.domain.tld'. [10 repetitions removed]
 > Sep 06 14:31:33 dns.i.domain.tld pdns_server[30635]: Sep 06 14:31:33 
AXFR of domain 'i.domain.tld' to 10.9.0.66 finished
 > Sep 06 14:31:33 dns.i.domain.tld pdns_server[30635]: Sep 06 14:31:33 
gmysql Connection successful. Connected to database 'dns' on 
'db.domain.tld'.
 > Sep 06 14:31:33 dns.i.domain.tld pdns_server[30635]: Sep 06 14:31:33 
Done signing: 6425.07 sigs/s,
 > Sep 06 14:31:33 dns.i.domain.tld pdns_server[30635]: Sep 06 14:31:33 
AXFR of domain '0.9.10.in-addr.arpa' to 10.9.0.66 finished
 > Sep 06 14:31:34 dns.i.domain.tld pdns_server[30635]: Sep 06 14:31:34 
Removed from notification list: 'i.domain.tld' to 10.9.0.65:53 (was 
acknowledged)
 > Sep 06 14:31:34 dns.i.domain.tld pdns_server[30635]: Sep 06 14:31:34 
Removed from notification list: 'i.domain.tld' to 10.9.0.66:53 (was 
acknowledged)
 > Sep 06 14:31:34 dns.i.domain.tld pdns_server[30635]: Sep 06 14:31:34 
Removed from notification list: '0.9.10.in-addr.arpa' to 10.9.0.65:53 
(was acknowledged)
 > Sep 06 14:31:34 dns.i.domain.tld pdns_server[30635]: Sep 06 14:31:34 
Removed from notification list: '0.9.10.in-addr.arpa' to 10.9.0.66:53 
(was acknowledged)
 > Sep 06 14:31:36 dns.i.domain.tld pdns_server[30635]: Sep 06 14:31:36 
No master domains need notifications

I appreciate any information or advice anyone can provide, and I am able 
to reproduce this consistently should more testing be required!

Thanks,
Joshua Boniface

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.powerdns.com/pipermail/pdns-users/attachments/20170906/ea69a241/attachment.html>


More information about the Pdns-users mailing list