[Pdns-users] Odd master/slave behavior for large domains

Fri Sep 11 22:07:21 UTC 2009

On Sep 11, 2009, at 12:56 PM, bert hubert wrote:

> On Fri, Sep 11, 2009 at 7:14 AM, thomas morgan <tm at zerigo.com> wrote:
>> I created a single zone on the server and added 2 million host  
>> records. I
>> know that's a bunch, but it is a specific use case, not just an  
>> attempt to
>> break things.
>
> Thomas,
>
> Many thanks for your interesting and detailed bug report! I've done
> some initial investigation, and my guess is that this is a deficiency
> in our PostgreSQL support, or perhaps in the PostgreSQL client library
> (unlikely I think).
>
> Sadly I am very busy right now, but you could help debugging this if
> you could reproduce this issue (or perhaps, fail to) using the MySQL
> backend or the BIND backend.
>
> If this clears up the issue, I know where to look.

Bert--

I tried it with the bind backend. Notes on that and other items below.

>> Oddity #1:
>> The master seems to send 3-4 NOTIFYs when the zone is updated -- at  
>> least
>> the slave is reporting in the logs to have received multiple NOTIFYs.
>> They're pretty consistently spaced: in several instances, 4 NOTIFYs  
>> with
>> intervals 3sec, 5sec, 9sec.
>
> This is just laziness - we keep sending NOTIFY packets until they are
> acknowledged, which PowerDNS only does after the AXFR succeeded.
>
>> Oddity #2:
>> The slave, upon receiving multiple NOTIFYs, initiates multiple  
>> AXFRs for the
>> same zone. For a small zone this wouldn't be a big deal, but for a  
>> large
>> zone it's fatal.
>
> This one is stupid on our side.

Thinking on this some more, in addition to keeping the slave from  
initiating multiple, identical AXFRs, what about the possibility of  
the master limiting the number of simultaneous AXFRs from a remote IP?  
I'm thinking something configurable. An overall cap on AXFRs might be  
appropriate too -- for a server with zones of any larger size, this  
would seem like a fairly easy DoS vector due to the memory usage (even  
if memory usage is substantially reduced).

Maintaining a single in-memory copy of the AXFR even for multiple  
transfers from multiple IPs would also potentially be useful.

Ken's suggestion of using a cursor when preparing the AXFR might be  
another way to improve memory use in these situations.

>> Oddity #3:
>> The master consumes a *huge* amount of RAM to do each AXFR: 283 MB  
>> worth per
>> AXFR. Memory usage on the slave seems tolerable though.
>>
>> Oddity #4:
>> If an AXFR doesn't finish properly, the memory is never released.  
>> I've
>> managed to reproduce this using dig to perform the AXFR as well.
>
> 3 and 4 should go away if we debug the (probable) PostgreSQL support
> problem in PowerDNS.
>
>> Anything I'm missing or that I can do to help figure out what's  
>> wrong? Some
>> logs are included below; they are not identical in every case, but  
>> this is
>> representative.
>
> Reproducing with the BIND backend should help find the cause of this.
> You can just ask it to load the output of dig -t AXFR > zone.
>
> Please let me know! From there we'll look at the other issues you  
> found.

I did as suggested and used the dig AXFR output as the source zone file.

The bind backend does not significantly increase memory usage for an  
AXFR. This seems expected since the zone is already loaded into  
memory. Memory sits at 156m after start and zone load -- still quite a  
bit less than the 283m when building the zone from the PG database for  
an AXFR.

However, after that things get interesting. There appears to be a  
memory leak that shows up when an AXFR (possibly any query,  
unconfirmed) is attempted during a zone reload. This might be a  
separate leak from the leak during an aborted PG-backed AXFR.

Reloads take 20-30sec on this box for the 2 million host records, so  
there's a decent sized time window for a triggering query to occur.  
The exact size of the leak seems to vary somewhat, although I did see  
PDNS up to 248m a couple of times and 299m a couple of times. It did  
seem to vary; perhaps related to the point the reload was at when the  
triggering query came in.

A completed or failed AXFR (even several at once) didn't seem to  
trigger this -- it only seemed to happen when timed during the zone  
reload.

Additionally, twice, when the slave's AXFR request came in during the  
reload, PDNS aborted and had to be respawned by the guardian:  
"Communicator thread died because of error: Zone for 'h0.com' in '/ 
root/pdns/h0.com' temporarily not available (file missing, or master  
dead)". Most of the time it just returned SERVFAIL while reloading  
(which appears to be correct from the documentation).

Somewhat separate I think, during this round of testing I never did  
get the slave (still running PG) to successfully complete an AXFR,  
although I did yesterday. An AXFR via dig could complete just fine. I  
suspect this is related to what Ken has written about -- with the  
slave having trouble writing all that data to the DB inefficiently.

So to summarize, the memory use when preparing for an AXFR (and leak  
when aborted) seems to happen when backed by PG but not when backed by  
bind files. I haven't tried MySQL, but may this weekend if I find some  
spare time.

The leak and crash during bind-backed zone reload is probably bind  
backend specific, although maybe not and it's just more likely to  
happen because of the large zone reload time.

While I haven't tried it yet, the hooks Ken has added for slave AXFR  
commits seem like they would be a useful addition to improving PG- 
backed (and maybe any DB backed) AXFR consumption.

Let me know what other info I can provide, what else I can test, etc.

--t