[Pdns-users] Possible AXFR Race Condition
p8x
l at p8x.net
Wed Feb 23 07:54:57 UTC 2011
Hi all,
After spending the day testing and mucking around I think I have
narrowed it down slightly. On the master, I have about 77,000 domains.
When I add a new zone to it I do the following:
1. A new zone is added to the supermaster server and a "pdns_control
rediscover" is then run. The zone appears on the master and and queries
to the zone work successful. At this step everything is ok.
2. The super master sends the notify to the slave, and the slave
receives the notify with the following log lines (from the slave):
Feb 23 15:00:20 dns2 pdns[17029]: Received NOTIFY for test.com from
192.168.1.10 for which we are not authoritative
Feb 23 15:00:20 dns2 pdns[17029]: [bindbackend] Writing bind config zone
statement for superslave zone 'test.com' from supermaster 203.1708
Feb 23 15:00:20 dns2 pdns[17029]: Created new slave zone 'test.com' from
supermaster 192.168.1.10, queued axfr
Feb 23 15:00:20 dns2 pdns[17029]: Initiating transfer of 'test.com' from
remote '192.168.1.10'
Feb 23 15:00:20 dns2 pdns[17029]: Can't determine backend for domain
'test.com'
On the slave, I can see that it has added that zone to its bind slave
configuration file successful. The zone transfer doesn't actually happen
and the zone file doesn't get written on the slave (I waited for a few
minutes to ensure that this is the case). Queries for the zone fail on
the slave.
At this stage I can do one of two things to get the zone to work:
a) Restart the pdns server on the slave, it realises that the zone
doesnt exist and the transfer actually completes
b) Send another notify from the master to the slave. This causes the
symptom below where the zone gets added to the bind slave config file a
second time. The zone transfer also happens this time, as can be seen
with these log lines:
Feb 23 15:22:29 dns2 pdns[17029]: Received NOTIFY for test.com from
192.168.1.10 for which we are not authoritative
Feb 23 15:22:29 dns2 pdns[17029]: [bindbackend] Writing bind config zone
statement for superslave zone 'test.com' from supermaster 203.1708
Feb 23 15:22:29 dns2 pdns[17029]: Created new slave zone 'test.com' from
supermaster 192.168.1.10, queued axfr
Feb 23 15:22:29 dns2 pdns[17029]: Initiating transfer of 'test.com' from
remote '192.168.1.10'
Feb 23 15:22:29 dns2 pdns[17029]: AXFR started for 'test.com',
transaction started
Feb 23 15:22:29 dns2 pdns[17029]: Remote 192.168.1.10 tried to sneak in
out-of-zone data '' during AXFR of zone 'test.com', ignoring
Feb 23 15:22:29 dns2 pdns[17029]: Zone 'test.com'
(/etc/powerdns/zones/test.com) reloaded
Feb 23 15:22:29 dns2 pdns[17029]: AXFR done for 'test.com', zone committed
I noticed after the second notify, the line saying "Can't determine
backend for domain 'test.com'" does not appear, this only seems to
happen after the first notify. I am assuming this is what is causing the
issue, but I can't see why this is happening as a restart of the pdns
server with no config changes seems to fix it.
Has anyone experienced this?
Thanks
On 22/02/2011 5:23 PM, p8x wrote:
> Hi all,
>
> I am seeing a strange issue with a large amount of zone transfers that
> are happening in a short amount of time.
>
> I am running 2 pdns servers, one as a super master and one as a super
> slave. Both servers are only configured to use the Bind backend. When I
> have added a large amount of zones to the master and the notify messages
> get sent out to the slave (which is expected) to notify the slave of the
> new zones. When there are a large amount of notify messages the slave
> server seems to get the same zone inserted multiple times to the slave
> configuration due to what appears to be a race as the master resends the
> notify if it doesn't get an acknowledgement back.
>
> On the master, for the notify for a domain the following messages are
> logged:
>
> Feb 22 16:08:16 dns1 pdns[17432]: Notification request to host
> 192.168.1.1 for domain 'test.com' received
> Feb 22 16:08:25 dns1 pdns[17432]: Removed from notification list:
> 'test.com' to 192.168.1.1 (was acknowledged)
> Feb 22 16:08:27 dns1 pdns[17432]: Approved AXFR of 'test.com' from
> recently notified slave 192.168.1.1
> Feb 22 16:08:27 dns1 pdns[17432]: AXFR of domain 'test.com' initiated by
> 192.168.1.1
> Feb 22 16:08:27 dns1 pdns[17432]: AXFR of domain 'test.com' to
> 192.168.1.1 finished
> Feb 22 16:08:35 dns1 pdns[17432]: Approved AXFR of 'test.com' from
> recently notified slave 192.168.1.1
> Feb 22 16:08:35 dns1 pdns[17432]: AXFR of domain 'test.com' initiated by
> 192.168.1.1
> Feb 22 16:08:35 dns1 pdns[17432]: AXFR of domain 'test.com' to
> 192.168.1.1 finished
> Feb 22 16:08:43 dns1 pdns[17432]: Approved AXFR of 'test.com' from
> recently notified slave 192.168.1.1
> Feb 22 16:08:43 dns1 pdns[17432]: AXFR of domain 'test.com' initiated by
> 192.168.1.1
> Feb 22 16:08:43 dns1 pdns[17432]: AXFR of domain 'test.com' to
> 192.168.1.1 finished
>
> On the slave server, the following messages are logged:
>
> Feb 22 16:08:16 dns2 pdns[24071]: Received NOTIFY for test.com from
> 192.168.1.10 for which we are not authoritative
> Feb 22 16:08:16 dns2 pdns[24071]: [bindbackend] Writing bind config zone
> statement for superslave zone 'test.com' from supermaster 192.168.1.10
> Feb 22 16:08:16 dns2 pdns[24071]: Created new slave zone 'test.com' from
> supermaster 192.168.1.10, queued axfr
> Feb 22 16:08:20 dns2 pdns[24071]: Received NOTIFY for test.com from
> 192.168.1.10 for which we are not authoritative
> Feb 22 16:08:20 dns2 pdns[24071]: [bindbackend] Writing bind config zone
> statement for superslave zone 'test.com' from supermaster 192.168.1.10
> Feb 22 16:08:20 dns2 pdns[24071]: Created new slave zone 'test.com' from
> supermaster 192.168.1.10, queued axfr
> Feb 22 16:08:25 dns2 pdns[24071]: Received NOTIFY for test.com from
> 192.168.1.10 for which we are not authoritative
> Feb 22 16:08:25 dns2 pdns[24071]: [bindbackend] Writing bind config zone
> statement for superslave zone 'test.com' from supermaster 192.168.1.10
> Feb 22 16:08:25 dns2 pdns[24071]: Created new slave zone 'test.com' from
> supermaster 192.168.1.10, queued axfr
> Feb 22 16:08:27 dns2 pdns[24071]: Initiating transfer of 'test.com' from
> remote '192.168.1.10'
> Feb 22 16:08:27 dns2 pdns[24071]: Can't determine backend for domain
> 'test.com'
> Feb 22 16:08:35 dns2 pdns[24071]: Initiating transfer of 'test.com' from
> remote '192.168.1.10'
> Feb 22 16:08:35 dns2 pdns[24071]: Can't determine backend for domain
> 'test.com'
> Feb 22 16:08:43 dns2 pdns[24071]: Initiating transfer of 'test.com' from
> remote '192.168.1.10'
> Feb 22 16:08:43 dns2 pdns[24071]: Can't determine backend for domain
> 'test.com'
>
> The slave configuration file ends up with these comments for the zone
> that is inserted multiple times, which seems to match up with when the
> notify messages were received:
>
> # Superslave zone test.com (added: Tue Feb 22 16:08:16 2011) (account: )
> # Superslave zone test.com (added: Tue Feb 22 16:08:20 2011) (account: )
> # Superslave zone test.com (added: Tue Feb 22 16:08:25 2011) (account: )
>
> I am not 100% sure about this, but the way I read the logs is that the
> notification request is sent from the master at 16:08:16 and the slaves
> gets it straight away and starts writing out the config. As the
> acknowledgement doesn't arrive until a few seconds later (16:08:25), in
> that 9 second gap there is a further two notifications sent as there was
> no reply in some sort of time period.
>
> I also noticed that the zone file doesn't actually get written out on
> the slave server either. Running a 'pdns_control rediscover' or
> 'pdns_control bind-reload-now test.com' doesn't seem to get it to write
> it out either (the zone is in the slave configuration file). The
> bind-reload-now command gives an error saying that the zone file doesn't
> exist, so it also shows up in the reject domains. A restart of the pdns
> server process seems to correct the issue - the zone transfer seems to
> actually write out the zone file and DNS queries for the zone then seem
> to work on the secondary. From the secondary, after a server restart I
> can see the following relevant message in the log:
>
> Feb 22 16:54:19 dns2 pdns[30258]: Received valid NOTIFY for test.com
> (id=1369) from master 192.168.1.10: 2011020102 > 0
>
> After the above message, the zone file is written out.
>
> I also see this message printed every minute or so (the SOA records give
> a slave refresh time of 24H):
>
> Feb 22 17:00:52 dns2 pdns[30540]: Domain test.com is stale, master
> serial 2011020102, our serial 0
>
> After the zone file is written out, DNS queries for that zone on the
> secondary also seem to fail (with dig shows recursion requested but
> denied) but a restart of the process again after the zone file is
> written out seems to fix this too.
>
> These issues seem to happen when I have a large number of zones (each
> zone has about 10-15 records all up and there are approx 77,000 zones)
> and at the time there are a lot of notifications flying around the
> place. I can reproduce the issue by clearing the slave zone file and
> zones (so that the secondary has no DNS zones configured) and then
> sending more than 1 notify request to the slave in quick succession
> using 'pdns_control notify-host' (although above a single notification
> is being sent but it is happening at the same time as many other
> notifications).
>
> I have tried a few versions of pdns (2.9.22-3 (Ubuntu), 2.9.22-8
> (Debian) as well as the latest snapshot for version 3 (compiled)).
>
> The configuration for pdns is as follows for the slave:
>
> allow-recursion=127.0.0.1
> bind-check-interval=60
> bind-config=/etc/powerdns/slave.conf
> bind-supermaster-config=/etc/powerdns/slave.conf
> bind-supermaster-destdir=/etc/powerdns/zones
> bind-supermasters=/etc/powerdns/masters.conf
> cache-ttl=20
> config-dir=/etc/powerdns
> daemon=yes
> disable-tcp=no
> guardian=yes
> launch=bind
> lazy-recursion=yes
> local-address=192.168.1.1
> local-port=53
> master=no
> max-tcp-connections=30
> module-dir=/usr/lib/powerdns
> query-local-address=192.168.1.1
> setgid=pdns
> setuid=pdns
> slave=yes
> socket-dir=/var/run
>
> I am happy to test anything if anyone has any suggestions to what could
> be causing the zones to be doubled up.
>
> Thanks
>
>
> _______________________________________________
> Pdns-users mailing list
> Pdns-users at mailman.powerdns.com
> http://mailman.powerdns.com/mailman/listinfo/pdns-users
More information about the Pdns-users
mailing list