[Pdns-users] pdns stops responding to requests

Tue Dec 19 04:03:34 UTC 2006

Here are all of the details I can come up with:

1- tcpdump clearly shows that the server is getting DNS requests

2- Output of "pdns dump"
[root at powerdns2 /]# /etc/init.d/pdns dump
corrupt-packets=426,deferred-cache-inserts=44,deferred-cache-lookup=100,latency=32709,
packetcache-hit=17770,packetcache-miss=20433,packetcache-size=5658,qsize-q=37,
query-cache-hit=89780,query-cache-miss=83406,recursing-answers=0,recursing-questions=0,
servfail-packets=0,tcp-answers=2,tcp-queries=2,timedout-packets=0,udp-answers=37835,
udp-queries=38336,udp4-answers=37835,udp4-queries=38282,udp6-answers=0,udp6-queries=0,

When it was locked up, I ran the command several minutes later and the numbers
were exactly the same

3- Confirmed that the built-in PowerDNS web server was up and accessible.
Interesting numbers from that page:

Uptime: 1.85 hours Queries/second, 1, 5, 10 minute averages: 2.61e-14, 0.0111,
0.304. Max queries/second: 11.4
Cache hitrate, 1, 5, 10 minute averages: 49%, 48%, 48%
Backend query cache hitrate, 1, 5, 10 minute averages: 48%, 50%, 51%
Backend query load, 1, 5, 10 minute averages: 4.18e-14, 0.0216, 0.614. Max
queries/second: 23.8
Total queries: 38336. Question/answer latency: 32.7ms

corrupt-packets         426     Number of corrupt packets received
deferred-cache-inserts  44      Amount of cache inserts that were deferred
because of maintenance
deferred-cache-lookup   100     Amount of cache lookups that were deferred
because of maintenance
latency                 32709   Average number of microseconds needed to answer
a question
packetcache-hit         17770
packetcache-miss        20433
packetcache-size        5658
qsize-q                 37      Number of questions waiting for database attention
query-cache-hit         89780   Number of hits on the query cache
query-cache-miss        83406   Number of misses on the query cache
recursing-answers       0       Number of recursive answers sent out
recursing-questions     0       Number of questions sent to recursor
servfail-packets        0       Number of times a server-failed packet was sent out
tcp-answers             2       Number of answers sent out over TCP
tcp-queries             2       Number of TCP queries received
timedout-packets        0       Number of packets which weren't answered within
timeout set
udp-answers             37835   Number of answers sent out over UDP
udp-queries             38336   Number of UDP queries received
udp4-answers            37835   Number of IPv4 answers sent out over UDP
udp4-queries            38282   Number of IPv4UDP queries received
udp6-answers            0       Number of IPv6 answers sent out over UDP
udp6-queries            0       Number of IPv6 UDP queries received

4- A couple of show commands:
[root at powerdns2 /]#  /etc/init.d/pdns show qsize-q
qsize-q=37

[root at powerdns2 /]#  /etc/init.d/pdns show packetcache-size
packetcache-size=5658

5- While it was locked up, I did an strace on each of the pdns processes.   The
3 processes I believe were the MySQL Backends just said "rt_sigsuspend([]".  The
process that I saw the actual queries coming in/going out of also just said
"rt_sigsuspend([]"

One interesting thing from the straces was this:
this process:
pdns       821   817  0 18:40 ?        00:00:00 /usr/sbin/pdns_server-instance
--daemon --guardian=yes
is doing this each second:
recvfrom(13, 0xbebff2bc, 1500, 0, 0xbebff8f8, 0xbebff91c) = -1 EAGAIN (Resource
temporarily unavailable)
time(NULL)                              = 1166498719
time(NULL)                              = 1166498719
rt_sigprocmask(SIG_BLOCK, [CHLD], [RTMIN], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [RTMIN], NULL, 8) = 0

The "Resource temporarily unavailable" may seem to point to something, but it
appears there during normal operation as well.

It may be worth noting, that if I leave an strace running on the process that is
sending/receiving requests, it dies occasionally with this message:

gettimeofday(upeek: ptrace(PTRACE_PEEKUSER,12878,44,0): No such process
Process 12878 detached

I should also point out that this powerdns server is actually a virtual server
running on a Linux VServer with kernel 2.6.18.3-vs2.1.1.2.  The other powerdns
server that we are running and that is working fine is also running on a Linux
VServer with kernel 2.6.18.2-vs2.1.1

Any ideas about why this is dying would be appreciated.  This server is an
authoritative name for about a thousand domains, so I obviously would like to
get this resolved instead of having to restart the process when it dies (which
happens 5-8 times a day).

Thanks,
Brandon Checketts
Webpipe.net System Administrator

Brandon Checketts wrote:
> Augie,
> 
> Thank you for your response.  I'm running version 2.9.20 that was installed from
> the pdns-static-2.9.20-1 RPM.  UDP is definitely failing, not sure about TCP as
> we get very few TCP Requests.
> 
> I've just installed strace, and can run it.  Each pdns_server-instance processes
> display different things, but I'm not a programmer, so I don't understand much
> of the output.   I can pick out the three processes that are doing the MySQL
> queries, and the one that looks like its receiving/sending the actual DNS
> traffic, so I will do an strace on it the next time that it fails to see if it
> identifies anything.  I will also try the pdns "dump" and see if that reveals
> anything.
> 
> I looked at a tcpdump that I had running at the time of a failure and didn't see
> anything unusual, just DNS requests, MySQL lookups/replys and DNS answers.
> Then, when it fails, only incoming DNS requests with no other traffic until it
> is restarted.
> 
> Thanks,
> Brandon Checketts
> Webpipe.net System Administrator
> 
> 
> 
> Augie Schwer wrote:
>> On 12/17/06, Brandon Checketts <brandonc at webpipe.net> wrote:
>>> I've just replaced two BIND servers with PowerDNS Servers.   They are
>>> configured
>>> to use the gmysql backend, and MySQL is performing the replication
>>> between them.
>>>   Everything seems to be working fine, but on the "slave" server, pdns
>>> seems to
>>> just quit responding to DNS requests sometimes.   The web server
>>> portion of it
>>> continues to respond, but DNS queries just don't get answered.   Also,
>>> when it
>>> locks up, a netstat -l -n shows this (notice the Recv-Q for udp on
>>> port 53):
>> What version are you running? If there is nothing in the logs, then
>> you can try straceing one of the threads to see if that reveals any
>> more info. Is it both UDP and TCP that stop responding? Did you try
>> tcpdumping on the interface to make sure you are receiving the
>> requests. You can also look at the stats:
>>
>> /etc/init.d/pdns dump
>>
>> The latency and q-size stats may be revealing.
>>
>>
> _______________________________________________
> Pdns-users mailing list
> Pdns-users at mailman.powerdns.com
> http://mailman.powerdns.com/mailman/listinfo/pdns-users
> 
>