[Pdns-users] Avoiding NOTIFY/AXFR overload? (was: Zone transfer from supermaster fails...)

Wed May 9 14:04:55 UTC 2007

Hi all,

After building custom Debian-packages with some source patches, I kind
of found out what the cause of my problems is.

Remember that one of my slave nameservers reported:
> Apr 23 13:37:07 callisto pdns[2433]: Received NOTIFY for assen.nl from
> 87.251.57.140 for which we are not authoritative
> Apr 23 13:37:08 callisto pdns[2433]: Error resolving SOA or NS for
> 'assen.nl' at 87.251.57.140

I found out this is caused by the master nameserver getting flooded by
NOTIFY packets which were sent by itself...
My domains have three nameservers, all of them are behind IPCop
firewalls on several locations and have internal IP's (192.168.*.*).
Obviously, my NS-records don't mention the private IP-addresses, but the
WAN-addresses of the IPCop-machines which perform masquerading. Note
that pdns is unaware of this WAN-address!

The master nameserver (unaware of the fact that he's behind a firewall
and actually handles the DNS-traffic on the WAN IP) sends notifications
to all of the nameservers listed for a domain, including itself. While
being flooded by NOTIFY packets (which are dropped, of course), the
master nameserver is temporarily not available to the other nameservers
which also respond to the NOTIFY packets. This leads to the following
log entries

> May  9 15:31:32 callisto pdns[11817]: Error resolving SOA or NS for 'gerrits.com' at 87.251.57.140: Timeout waiting for answer from 87.251.57.140
(using non-standard Debian packages to be able to see the cause of the
problem)

After a few seconds, the master nameserver has taken care of those
irrelevant NOTIFY packets and becomes available again. In this shore
timespan, the slave nameservers were unable to fetch the first n domains
of which they were notified. Note that the slave nameservers won't retry
the AXFR after such a failure! The consequence is, that the slave
nameservers might be missing some domains which is of course a /very
bad/ situation: all nameservers should serve the same domains and should
retry a failed AXFR of a new domain after a NOTIFY from a supermaster!

Summarizing: in my opinion, the following issues should be fixed:
 - add a configuration directive to extend the knowledge of the network
 pdns is in. Suggestion: introduce an option "ignore-notify-ips" which
can be used to list a WAN IP of which pdns isn't aware and/or introduce
an option "real-wan-ip" to achieve the same.
 - introduce the ability to configure the maximum of NOTIFY packets
being sent per minute (or so), to avoid flooding slave nameservers with
NOTIFY packets and being flooded with AXFR requests from those slave
nameservers.
 - introduce the ability to configure the maximum of AXFR requests being
sent to a nameserver at once, to avoid flooding the master nameserver
with AXFR requests.
 - the slave nameserver should store NOTIFY packets from supermasters,
to be able to retry failed AXFR requests of new domains. This prevents
slave nameservers from missing some domains.

I hope that my suggestions can be fixed soon, because this problem can
lead to DNS outages in some rare cases (like mine)!

  -- Bas van Schaik

Bas van Schaik wrote:
> Bas van Schaik wrote:
>> Hi all,
>>
>> I'm currently setting up a third nameserver for my domains. The primary
>> and secundary are using native SQL replication (works great!) but are
>> physically in the same building, which is of course not very fault
>> tolerant. Now, I'm setting up the third nameserver (master/slave
>> replication) using the primary nameserver as a supermaster. For the
>> first few hundred domains this initial transfer works perfectly, but
>> after a while the slave server begins to throw errors like this:
>>> Apr 23 13:37:07 callisto pdns[2433]: Received NOTIFY for assen.nl from
>>> 87.251.57.140 for which we are not authoritative
>>> Apr 23 13:37:08 callisto pdns[2433]: Error resolving SOA or NS for
>>> 'assen.nl' at 87.251.57.140
>> (note that "callisto" is the hostname of the slave nameserver,
>> 87.251.57.140 is my primary nameserver)
>>
>> After a while, the slave holds about 600 domains, but the master
>> nameserver has about 1500 domains! Based on these errors, I started
>> investigating the records for "assen.nl" (and some of the other domains
>> failing to transfer), but found nothing suspicious. So I just retried
>> the NOTIFY on the master:
>>> pdns_control notify assen.nl
>> Which lead to the following log entries on the slave:
>>> Apr 23 13:57:20 callisto pdns[2453]: Received NOTIFY for assen.nl from
>>> 87.251.57.140 for which we are not authoritative
>>> Apr 23 13:57:20 callisto pdns[2453]: Created new slave zone 'assen.nl'
>>> from supermaster 87.251.57.140, queued axfr
>>> Apr 23 13:57:21 callisto pdns[2429]: AXFR started for 'assen.nl',
>>> transaction started
>>> Apr 23 13:57:21 callisto pdns[2429]: AXFR done for 'assen.nl', zone
>>> committed
>> Note that I changed nothing in the "assen.nl" zone on the master at all!
>> There seems to be a problem with the initial transfer of hundreds of
>> domains from a master to a slave? I already tried to change the number
>> of running threads on both master and slave, but that didn't do the
>> trick. I also noticed that some NOTIFY-packets are never received by the
>> slave. Master's log:
>>> $ cat daemon.log | grep -i adselectshop.nl
>>> Apr 23 14:19:57 helios pdns[5365]: Queued notification of domain
>>> 'adselectshop.nl' to 80.89.236.78
>> ("callisto" is the slave nameserver with IP 80.89.236.78)
>>
>> And the slave log:
>>> $ cat daemon.log | grep -i adselectshop.nl
>>> (no output)
>> Again, executing "pdns_control notify adselectshop.nl" on the master
>> nameserver manually did the trick for this domain. Anyone out there to
>> enlighten me?
>>
> 
> Anyone? It's a really annoying problem!
> 
>   -- Bas
> _______________________________________________
> Pdns-users mailing list
> Pdns-users at mailman.powerdns.com
> http://mailman.powerdns.com/mailman/listinfo/pdns-users