[Pdns-users] Bizarre simultaneous crash of three auth-only pdns servers

Lorens Kockum lorens-pdns-3987 at tagged.lorens.org
Mon Dec 15 16:21:36 UTC 2008


Hi,

One of my primary nameservers had a kernel fault and froze. Logs
say some kind of memory parity error, OK, we'll replace the
memory or the whole machine, it happens, things like that are
kind of hard to avoid.

To avoid problems *when* it happens, I replicate that pdns
nameserver to two secondaries using basic DNS
notifies. One of those is in another AS and another *continent*.
All three servers run Debian and postgresql, all three are
auth-only (will reply servfail if asked about a domain that is
not in the database).

My problem is that during the time the primary was down, the two
secondaries stopped replying, within just a few minutes.

I do have a script to keep synchronization if face of missed
notifies, but all it does is execute pdns_control notify on the
master, so that script wasn't running.

On both secondaries, I had the time to stop the pdns process,
restart postgresql, start pdns, try and fail to get answers from
pdns, and to log in to the databse and check that the data was
correct there. On one (running 2.9.20 from debian), I had the
time to look at the logs, and see LOADS of

    Dec 15 14:52:46 ns2 pdns[32531]: Error trying to
    retrieve/refresh '${NAMEOFDOMAIN}.com': Timeout waiting for
    answer

apparently to all domains in the database, but the machine was
still not replying to queries.

I was certain that upon restart pdns started serving domains
immediately (unlike bind which takes a long time to read all its
zone files...)

I updated the 'master' field in the domains table to '', and
restarted pdns, and some small time later I did get replies, but
by that time the master server was running again and the other
secondary (running 2.9.21 from debian) was rebooting, so I
changed it back.

I suppose this isn't known pdns behaviour, or it would have been
fixed, but I'm not totally utterly certain it's the fault of
pdns either, I'll have to do some testing to be certain. I know
I've run pdns for some five years without the slightest problem,
but I can't say I've ever tried to restart a secondary when
the primary is not up (and that doesn't explain why they stopped
replying in the first place, of course). Has anybody had similar
problems?

-- 
Thanks,
Lorens


More information about the Pdns-users mailing list