[Pdns-users] Odd master/slave behavior for large domains
thomas morgan
tm at zerigo.com
Fri Sep 11 05:14:18 UTC 2009
I'm seeing some odd behavior in a traditional master and slave setup.
Both the master and slave are PowerDNS. I've tried 2.9.21.2 and 2.9.22
and seem to get the same results. Memory usage and other figures below
are from 2.9.22.
The backends are Postgres 8.3. Servers are amd64 Debian Lenny.
I created a single zone on the server and added 2 million host
records. I know that's a bunch, but it is a specific use case, not
just an attempt to break things.
Oddity #1:
The master seems to send 3-4 NOTIFYs when the zone is updated -- at
least the slave is reporting in the logs to have received multiple
NOTIFYs. They're pretty consistently spaced: in several instances, 4
NOTIFYs with intervals 3sec, 5sec, 9sec.
Oddity #2:
The slave, upon receiving multiple NOTIFYs, initiates multiple AXFRs
for the same zone. For a small zone this wouldn't be a big deal, but
for a large zone it's fatal.
Oddity #3:
The master consumes a *huge* amount of RAM to do each AXFR: 283 MB
worth per AXFR. Memory usage on the slave seems tolerable though.
Oddity #4:
If an AXFR doesn't finish properly, the memory is never released. I've
managed to reproduce this using dig to perform the AXFR as well.
Strangely dig dies after 8600 records when piped to less, but makes it
through all 2 million when piped to wc or just output to the terminal.
This might be dig's problem, but it does make it easy to cause #4. The
final line of dig's output when it fails is ";; communications error:
end of file". When the AXFR completes, the log includes "AXFR of
domain 'h0.com' to 64.<client> finished". When it doesn't (and memory
isn't freed), this message never shows up.
I cannot tell, but it's possible the slave is eventually figuring out
it has multiple transfers for the same zone going and aborts them. The
logs (included at the bottom) give mixed readings on this (between the
master and slave). It may also be the master puking and closing the
connection.
On my test servers things usually end when the oomkiller rears its
ugly head. ;)
The combination of these becomes a significant problem. #1 causes #2
when then multiplies the effect of #3. #2 also increases the
likelihood of #4.
Anything I'm missing or that I can do to help figure out what's wrong?
Some logs are included below; they are not identical in every case,
but this is representative.
--t
Master server:
Sep 11 02:07:45 pdns-master-test pdns[4156]: 1 domain for which we are
master needs notifications
Sep 11 02:07:45 pdns-master-test pdns[4156]: Queued notification of
domain 'h0.com' to 64.<master>
Sep 11 02:07:45 pdns-master-test pdns[4156]: Queued notification of
domain 'h0.com' to 64.<slave>
Sep 11 02:07:45 pdns-master-test pdns[4156]: Received NOTIFY for
h0.com from 64.<master> which is not a master
Sep 11 02:07:46 pdns-master-test pdns[4156]: Received unsuccesful
notification report for 'h0.com' from 64.<master>, rcode: 5
Sep 11 02:07:46 pdns-master-test pdns[4156]: Removed from notification
list: 'h0.com' to 64.<master>
Sep 11 02:07:46 pdns-master-test pdns[4156]: AXFR of domain 'h0.com'
initiated by 64.<slave>
Sep 11 02:07:46 pdns-master-test pdns[4156]: gpgsql Connection succesful
Sep 11 02:07:48 pdns-master-test pdns[4156]: No master domains need
notifications
Sep 11 02:08:13 pdns-master-test pdns[4156]: AXFR of domain 'h0.com'
initiated by 64.<slave>
Sep 11 02:08:13 pdns-master-test pdns[4156]: gpgsql Connection succesful
Sep 11 02:08:23 pdns-master-test pdns[4156]: AXFR of domain 'h0.com'
initiated by 64.<slave>
Sep 11 02:08:25 pdns-master-test pdns[4156]: gpgsql Connection succesful
Sep 11 02:08:33 pdns-master-test pdns[4156]: AXFR of domain 'h0.com'
initiated by 64.<slave>
Sep 11 02:08:34 pdns-master-test pdns[4156]: gpgsql Connection succesful
Sep 11 02:09:21 pdns-master-test pdns[4156]: Communicator thread died
because of error: GSQLBackend unable to retrieve list of slave
domains: PostgreSQL failed to execute command: server closed the
connection unexpectedly#012#011This probably means the server
terminated abnormally#012#011before or while processing the request.
(The last line is where oomkiller decided to kill off postgres.)
Slave server:
Sep 11 02:07:45 pdns-slave-test pdns[4379]: Received valid NOTIFY for h0.com
(id=1) from master 64.<master>: 1252634826 > 1252632174
Sep 11 02:07:46 pdns-slave-test pdns[4379]: gpgsql Connection succesful
Sep 11 02:07:48 pdns-slave-test pdns[4379]: Received valid NOTIFY for h0.com
(id=1) from master 64.<master>: 1252634826 > 1252632174
Sep 11 02:07:51 pdns-slave-test pdns[4379]: AXFR started for 'h0.com',
transaction started
Sep 11 02:07:53 pdns-slave-test pdns[4379]: Received valid NOTIFY for h0.com
(id=1) from master 64.<master>: 1252634826 > 1252632174
Sep 11 02:08:02 pdns-slave-test pdns[4379]: Received valid NOTIFY for h0.com
(id=1) from master 64.<master>: 1252634826 > 1252632174
Sep 11 02:08:13 pdns-slave-test pdns[4379]: Unable to AXFR zone
'h0.com': Remote nameserver closed TCP connection
Sep 11 02:08:13 pdns-slave-test pdns[4379]: Aborting possible open
transaction for domain 'h0.com' AXFR
Sep 11 02:08:13 pdns-slave-test pdns[4379]: gpgsql Connection succesful
Sep 11 02:08:23 pdns-slave-test pdns[4379]: Unable to AXFR zone
'h0.com': Timeout waiting for answer from 64.<master> during AXFR
Sep 11 02:08:23 pdns-slave-test pdns[4379]: gpgsql Connection succesful
Sep 11 02:08:33 pdns-slave-test pdns[4379]: Unable to AXFR zone
'h0.com': Timeout waiting for answer from 64.<master> during AXFR
Sep 11 02:08:33 pdns-slave-test pdns[4379]: gpgsql Connection succesful
Sep 11 02:08:43 pdns-slave-test pdns[4379]: Unable to AXFR zone
'h0.com': Timeout waiting for answer from 64.<master> during AXFR
(nothing more until the master's postgres daemon was restarted)
The correlation between the "AXFR initated" on the master and the
"Timeout waiting for answer" on the slaves is strange too.
--
More information about the Pdns-users
mailing list