[Pdns-users] pdns stops responding to requests

Tue Dec 19 17:28:22 UTC 2006

Okay, I just caught something in the strace.

One thread was going along fine with stuff like this:

10853 09:41:28 recvfrom(5, "y\22\0\0\0\1\0\0\0\0\0\0\5wccon\3com\0\0\17\0\1",
512, 0, {sa_family=AF_INET, sin_port=htons(36921),
sin_addr=inet_addr("12.5.136.191")}, [16]) = 27
10853 09:41:28 gettimeofday({1166546488, 401614}, NULL) = 0
10853 09:41:28 time(NULL)               = 1166546488
10853 09:41:28 sendto(5, "y\22\200\2\0\1\0\0\0\0\0\0\5wccon\3com\0\0\17\0\1",
27, 0, {sa_family=AF_INET, sin_port=htons(36921),
sin_addr=inet_addr("12.5.136.191")}, 16) = 27
10853 09:41:28 gettimeofday({1166546488, 402038}, NULL) = 0
10853 09:41:28 recvfrom(5,
"\r\30\0\20\0\1\0\0\0\0\0\1\6ppp-58\0018\003192\00240\6"..., 512, 0,
{sa_family=AF_INET, sin_port=htons(46829),
sin_addr=inet_addr("205.244.47.142")}, [16]) = 77
10853 09:41:28 gettimeofday({1166546488, 507542}, NULL) = 0
10853 09:41:28 kill(10851, SIGRTMIN)    = 0
10853 09:41:28 recvfrom(5,
"Yv\0\20\0\1\0\0\0\0\0\1\00225\00290\00211\003208\7in-a"..., 512, 0,
{sa_family=AF_INET, sin_port=htons(53364),
sin_addr=inet_addr("208.40.195.228")}, [16]) = 54
10853 09:41:28 gettimeofday({1166546488, 518007}, NULL) = 0
10853 09:41:28 time(NULL)               = 1166546488
10853 09:41:28 sendto(5,
"Yv\204\0\0\1\0\1\0\0\0\0\00225\00290\00211\003208\7in-"..., 74, 0,
{sa_family=AF_INET, sin_port=htons(53364),
sin_addr=inet_addr("208.40.195.228")}, 16) = 74
10853 09:41:28 gettimeofday({1166546488, 518522}, NULL) = 0
10853 09:41:28 recvfrom(5,
":\317\0\20\0\1\0\0\0\0\0\1\3ns1\7webpipe\3net\0\0\34\0"..., 512, 0,
{sa_family=AF_INET, sin_port=htons(36921), sin_addr=inet_addr("12.5.136.191")},
[16]) = 44

Then, at the point where it died, the strace shows this:
10853 09:41:28 gettimeofday(^@^@^@^@^@^@^@^@^@^@^@^@^... About 110,000 more of
these ^@ symbols (as displayed in vi)... and then {1, 0})                   = 0
(and a bunch of spaces)

It then goes into a loop every second where it does this:
09:41:07 waitpid(10838, 0xbfe8ebd8, WNOHANG) = 0
09:41:07 rt_sigprocmask(SIG_BLOCK, [CHLD], [RTMIN], 8) = 0
09:41:07 rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
09:41:07 rt_sigprocmask(SIG_SETMASK, [RTMIN], NULL, 8) = 0
09:41:07 nanosleep({1, 0}, {1, 0})                   = 0

So, it would appear that some call to gettimeofday is getting populated with
garbage?  I'm not a C programmer, so I'm trying to look through the source code
to see where this call is happening, but finding it hard to follow.    Can
anybody see why this would be happening?

Thanks,
Brandon Checketts
Webpipe.net System Administrator

Brandon Checketts wrote:
> Here are all of the details I can come up with:
> 
> 1- tcpdump clearly shows that the server is getting DNS requests
> 
> 2- Output of "pdns dump"
> [root at powerdns2 /]# /etc/init.d/pdns dump
> corrupt-packets=426,deferred-cache-inserts=44,deferred-cache-lookup=100,latency=32709,
> packetcache-hit=17770,packetcache-miss=20433,packetcache-size=5658,qsize-q=37,
> query-cache-hit=89780,query-cache-miss=83406,recursing-answers=0,recursing-questions=0,
> servfail-packets=0,tcp-answers=2,tcp-queries=2,timedout-packets=0,udp-answers=37835,
> udp-queries=38336,udp4-answers=37835,udp4-queries=38282,udp6-answers=0,udp6-queries=0,
> 
> When it was locked up, I ran the command several minutes later and the numbers
> were exactly the same
> 
> 3- Confirmed that the built-in PowerDNS web server was up and accessible.
> Interesting numbers from that page:
> 
> Uptime: 1.85 hours Queries/second, 1, 5, 10 minute averages: 2.61e-14, 0.0111,
> 0.304. Max queries/second: 11.4
> Cache hitrate, 1, 5, 10 minute averages: 49%, 48%, 48%
> Backend query cache hitrate, 1, 5, 10 minute averages: 48%, 50%, 51%
> Backend query load, 1, 5, 10 minute averages: 4.18e-14, 0.0216, 0.614. Max
> queries/second: 23.8
> Total queries: 38336. Question/answer latency: 32.7ms
> 
> corrupt-packets         426     Number of corrupt packets received
> deferred-cache-inserts  44      Amount of cache inserts that were deferred
> because of maintenance
> deferred-cache-lookup   100     Amount of cache lookups that were deferred
> because of maintenance
> latency                 32709   Average number of microseconds needed to answer
> a question
> packetcache-hit         17770
> packetcache-miss        20433
> packetcache-size        5658
> qsize-q                 37      Number of questions waiting for database attention
> query-cache-hit         89780   Number of hits on the query cache
> query-cache-miss        83406   Number of misses on the query cache
> recursing-answers       0       Number of recursive answers sent out
> recursing-questions     0       Number of questions sent to recursor
> servfail-packets        0       Number of times a server-failed packet was sent out
> tcp-answers             2       Number of answers sent out over TCP
> tcp-queries             2       Number of TCP queries received
> timedout-packets        0       Number of packets which weren't answered within
> timeout set
> udp-answers             37835   Number of answers sent out over UDP
> udp-queries             38336   Number of UDP queries received
> udp4-answers            37835   Number of IPv4 answers sent out over UDP
> udp4-queries            38282   Number of IPv4UDP queries received
> udp6-answers            0       Number of IPv6 answers sent out over UDP
> udp6-queries            0       Number of IPv6 UDP queries received
> 
> 
> 4- A couple of show commands:
> [root at powerdns2 /]#  /etc/init.d/pdns show qsize-q
> qsize-q=37
> 
> [root at powerdns2 /]#  /etc/init.d/pdns show packetcache-size
> packetcache-size=5658
> 
> 
> 5- While it was locked up, I did an strace on each of the pdns processes.   The
> 3 processes I believe were the MySQL Backends just said "rt_sigsuspend([]".  The
> process that I saw the actual queries coming in/going out of also just said
> "rt_sigsuspend([]"
> 
> 
> One interesting thing from the straces was this:
> this process:
> pdns       821   817  0 18:40 ?        00:00:00 /usr/sbin/pdns_server-instance
> --daemon --guardian=yes
> is doing this each second:
> recvfrom(13, 0xbebff2bc, 1500, 0, 0xbebff8f8, 0xbebff91c) = -1 EAGAIN (Resource
> temporarily unavailable)
> time(NULL)                              = 1166498719
> time(NULL)                              = 1166498719
> rt_sigprocmask(SIG_BLOCK, [CHLD], [RTMIN], 8) = 0
> rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
> rt_sigprocmask(SIG_SETMASK, [RTMIN], NULL, 8) = 0
> 
> The "Resource temporarily unavailable" may seem to point to something, but it
> appears there during normal operation as well.
> 
> 
> It may be worth noting, that if I leave an strace running on the process that is
> sending/receiving requests, it dies occasionally with this message:
> 
> gettimeofday(upeek: ptrace(PTRACE_PEEKUSER,12878,44,0): No such process
> Process 12878 detached
> 
> I should also point out that this powerdns server is actually a virtual server
> running on a Linux VServer with kernel 2.6.18.3-vs2.1.1.2.  The other powerdns
> server that we are running and that is working fine is also running on a Linux
> VServer with kernel 2.6.18.2-vs2.1.1
> 
> 
> 
> Any ideas about why this is dying would be appreciated.  This server is an
> authoritative name for about a thousand domains, so I obviously would like to
> get this resolved instead of having to restart the process when it dies (which
> happens 5-8 times a day).
> 
> 
> Thanks,
> Brandon Checketts
> Webpipe.net System Administrator
> 
> 
> 
> 
> 
> Brandon Checketts wrote:
>> Augie,
>>
>> Thank you for your response.  I'm running version 2.9.20 that was installed from
>> the pdns-static-2.9.20-1 RPM.  UDP is definitely failing, not sure about TCP as
>> we get very few TCP Requests.
>>
>> I've just installed strace, and can run it.  Each pdns_server-instance processes
>> display different things, but I'm not a programmer, so I don't understand much
>> of the output.   I can pick out the three processes that are doing the MySQL
>> queries, and the one that looks like its receiving/sending the actual DNS
>> traffic, so I will do an strace on it the next time that it fails to see if it
>> identifies anything.  I will also try the pdns "dump" and see if that reveals
>> anything.
>>
>> I looked at a tcpdump that I had running at the time of a failure and didn't see
>> anything unusual, just DNS requests, MySQL lookups/replys and DNS answers.
>> Then, when it fails, only incoming DNS requests with no other traffic until it
>> is restarted.
>>
>> Thanks,
>> Brandon Checketts
>> Webpipe.net System Administrator
>>
>>
>>
>> Augie Schwer wrote:
>>> On 12/17/06, Brandon Checketts <brandonc at webpipe.net> wrote:
>>>> I've just replaced two BIND servers with PowerDNS Servers.   They are
>>>> configured
>>>> to use the gmysql backend, and MySQL is performing the replication
>>>> between them.
>>>>   Everything seems to be working fine, but on the "slave" server, pdns
>>>> seems to
>>>> just quit responding to DNS requests sometimes.   The web server
>>>> portion of it
>>>> continues to respond, but DNS queries just don't get answered.   Also,
>>>> when it
>>>> locks up, a netstat -l -n shows this (notice the Recv-Q for udp on
>>>> port 53):
>>> What version are you running? If there is nothing in the logs, then
>>> you can try straceing one of the threads to see if that reveals any
>>> more info. Is it both UDP and TCP that stop responding? Did you try
>>> tcpdumping on the interface to make sure you are receiving the
>>> requests. You can also look at the stats:
>>>
>>> /etc/init.d/pdns dump
>>>
>>> The latency and q-size stats may be revealing.
>>>
>>>
>> _______________________________________________
>> Pdns-users mailing list
>> Pdns-users at mailman.powerdns.com
>> http://mailman.powerdns.com/mailman/listinfo/pdns-users
>>
>>
> _______________________________________________
> Pdns-users mailing list
> Pdns-users at mailman.powerdns.com
> http://mailman.powerdns.com/mailman/listinfo/pdns-users
> 
>