[Pdns-users] Odd master/slave behavior for large domains

Fri Sep 11 05:14:18 UTC 2009

I'm seeing some odd behavior in a traditional master and slave setup.

Both the master and slave are PowerDNS. I've tried 2.9.21.2 and 2.9.22  
and seem to get the same results. Memory usage and other figures below  
are from 2.9.22.

The backends are Postgres 8.3. Servers are amd64 Debian Lenny.

I created a single zone on the server and added 2 million host  
records. I know that's a bunch, but it is a specific use case, not  
just an attempt to break things.

Oddity #1:
The master seems to send 3-4 NOTIFYs when the zone is updated -- at  
least the slave is reporting in the logs to have received multiple  
NOTIFYs. They're pretty consistently spaced: in several instances, 4  
NOTIFYs with intervals 3sec, 5sec, 9sec.

Oddity #2:
The slave, upon receiving multiple NOTIFYs, initiates multiple AXFRs  
for the same zone. For a small zone this wouldn't be a big deal, but  
for a large zone it's fatal.

Oddity #3:
The master consumes a *huge* amount of RAM to do each AXFR: 283 MB  
worth per AXFR. Memory usage on the slave seems tolerable though.

Oddity #4:
If an AXFR doesn't finish properly, the memory is never released. I've  
managed to reproduce this using dig to perform the AXFR as well.

Strangely dig dies after 8600 records when piped to less, but makes it  
through all 2 million when piped to wc or just output to the terminal.  
This might be dig's problem, but it does make it easy to cause #4. The  
final line of dig's output when it fails is ";; communications error:  
end of file". When the AXFR completes, the log includes "AXFR of  
domain 'h0.com' to 64.<client> finished". When it doesn't (and memory  
isn't freed), this message never shows up.

I cannot tell, but it's possible the slave is eventually figuring out  
it has multiple transfers for the same zone going and aborts them. The  
logs (included at the bottom) give mixed readings on this (between the  
master and slave). It may also be the master puking and closing the  
connection.

On my test servers things usually end when the oomkiller rears its  
ugly head. ;)

The combination of these becomes a significant problem. #1 causes #2  
when then multiplies the effect of #3. #2 also increases the  
likelihood of #4.

Anything I'm missing or that I can do to help figure out what's wrong?  
Some logs are included below; they are not identical in every case,  
but this is representative.

--t

Master server:

Sep 11 02:07:45 pdns-master-test pdns[4156]: 1 domain for which we are  
master needs notifications
Sep 11 02:07:45 pdns-master-test pdns[4156]: Queued notification of  
domain 'h0.com' to 64.<master>
Sep 11 02:07:45 pdns-master-test pdns[4156]: Queued notification of  
domain 'h0.com' to 64.<slave>
Sep 11 02:07:45 pdns-master-test pdns[4156]: Received NOTIFY for  
h0.com from 64.<master> which is not a master
Sep 11 02:07:46 pdns-master-test pdns[4156]: Received unsuccesful  
notification report for 'h0.com' from 64.<master>, rcode: 5
Sep 11 02:07:46 pdns-master-test pdns[4156]: Removed from notification  
list: 'h0.com' to 64.<master>
Sep 11 02:07:46 pdns-master-test pdns[4156]: AXFR of domain 'h0.com'  
initiated by 64.<slave>
Sep 11 02:07:46 pdns-master-test pdns[4156]: gpgsql Connection succesful
Sep 11 02:07:48 pdns-master-test pdns[4156]: No master domains need  
notifications
Sep 11 02:08:13 pdns-master-test pdns[4156]: AXFR of domain 'h0.com'  
initiated by 64.<slave>
Sep 11 02:08:13 pdns-master-test pdns[4156]: gpgsql Connection succesful
Sep 11 02:08:23 pdns-master-test pdns[4156]: AXFR of domain 'h0.com'  
initiated by 64.<slave>
Sep 11 02:08:25 pdns-master-test pdns[4156]: gpgsql Connection succesful
Sep 11 02:08:33 pdns-master-test pdns[4156]: AXFR of domain 'h0.com'  
initiated by 64.<slave>
Sep 11 02:08:34 pdns-master-test pdns[4156]: gpgsql Connection succesful
Sep 11 02:09:21 pdns-master-test pdns[4156]: Communicator thread died  
because of error: GSQLBackend unable to retrieve list of slave  
domains: PostgreSQL failed to execute command: server closed the  
connection unexpectedly#012#011This probably means the server  
terminated abnormally#012#011before or while processing the request.
(The last line is where oomkiller decided to kill off postgres.)

Slave server:

Sep 11 02:07:45 pdns-slave-test pdns[4379]: Received valid NOTIFY for h0.com 
  (id=1) from master 64.<master>: 1252634826 > 1252632174
Sep 11 02:07:46 pdns-slave-test pdns[4379]: gpgsql Connection succesful
Sep 11 02:07:48 pdns-slave-test pdns[4379]: Received valid NOTIFY for h0.com 
  (id=1) from master 64.<master>: 1252634826 > 1252632174
Sep 11 02:07:51 pdns-slave-test pdns[4379]: AXFR started for 'h0.com',  
transaction started
Sep 11 02:07:53 pdns-slave-test pdns[4379]: Received valid NOTIFY for h0.com 
  (id=1) from master 64.<master>: 1252634826 > 1252632174
Sep 11 02:08:02 pdns-slave-test pdns[4379]: Received valid NOTIFY for h0.com 
  (id=1) from master 64.<master>: 1252634826 > 1252632174
Sep 11 02:08:13 pdns-slave-test pdns[4379]: Unable to AXFR zone  
'h0.com': Remote nameserver closed TCP connection
Sep 11 02:08:13 pdns-slave-test pdns[4379]: Aborting possible open  
transaction for domain 'h0.com' AXFR
Sep 11 02:08:13 pdns-slave-test pdns[4379]: gpgsql Connection succesful
Sep 11 02:08:23 pdns-slave-test pdns[4379]: Unable to AXFR zone  
'h0.com': Timeout waiting for answer from 64.<master> during AXFR
Sep 11 02:08:23 pdns-slave-test pdns[4379]: gpgsql Connection succesful
Sep 11 02:08:33 pdns-slave-test pdns[4379]: Unable to AXFR zone  
'h0.com': Timeout waiting for answer from 64.<master> during AXFR
Sep 11 02:08:33 pdns-slave-test pdns[4379]: gpgsql Connection succesful
Sep 11 02:08:43 pdns-slave-test pdns[4379]: Unable to AXFR zone  
'h0.com': Timeout waiting for answer from 64.<master> during AXFR
(nothing more until the master's postgres daemon was restarted)

The correlation between the "AXFR initated" on the master and the  
"Timeout waiting for answer" on the slaves is strange too.

--