[Pdns-users] update on Exactly simultaneous PowerDNS Recursor Crashes in a number of places
bert.hubert at netherlabs.nl
Mon Feb 1 20:57:20 UTC 2010
I promised several people to summarise what we discovered about the spike of
packets. It is not quite conclusive, but enough that we know what to do.
The short summary is that the upcoming 3.2 release will contain some slight
tweaks to improve stability, but that there is no reason to rush anything.
At the timestamps indicated, around 5 large access providers told me that
they saw a very brief but sharp spike in the amount of DNS queries. The
accuracy with which people saw the phenomenon is striking - within a few
Some other providers did not see anything in their PowerDNS stats, but did
see significant increases in number of packets being sent/received to their
PowerDNS servers. This may reflect PowerDNS keeling over before being able
to report statistics.
Other providers did not see any effect at all. DNS Resolvers exclusively
serving mail servers appear not to have been affected at all.
>From the limited data available, it appears that a group of authoritative
nameservers, probably anycasted, briefly & simultaneously gave up the ghost.
>From other measurements, we know that this interruption caused client
computers ('end-user equipment') to repeat failing queries in a big way,
leading to massively increased query counts.
This tripped over busy PowerDNS servers in three important places, but did
not cause outages at any other place as far as we know.
>From studying what we do know, it appears this specific traffic pattern
caused PowerDNS to exceed memory capacity at some busy installations. The
cause of this has been found, but we have problems reproducing the
circumstances that triggered this behaviour. There is an open question here
what really happened.
Summarising, at the timestamps mentioned, something bad definitely occurred
('a disturbance in the force', or if you will 'a glitch in the matrix'), but
only three of our users were affected.
What scared me were the simultaneous reports coming in of crashes - even if
initially only from two places.
I want to thank many anonymous operators who supplied graphs with
spectacular, or no, peaks in them. While I'm happy & grateful they shared,
it is quite sad that 'shareholder accountability' now means that this all
has to happen off the record!
More information about the Pdns-users