[Pdns-users] Spikey response times in powerdns recursor

Simon Bedford sbedford at plus.net
Wed Mar 17 10:43:19 UTC 2010


Hi guys,

Apologies if this has been discussed before but as a new mailling list 
user I have not seen anything.

We have been running  recursor as a caching name server for a number of 
months having moved from unbound, since this time we see good, in fact 
quick DNS response time but then when running 3.1.7.1 and .2 and also 
3.2.1 we see random spikes up to 2 seconds for the response times often 
at the quietest of times for the name servers.

I had put this down to 3.1 version after reaading the changelog and bugs 
fixed in 3.2 but having upgraded we still see the same spiking, this 
time more frequent over night but not quite as severe as they were.

We are using hardware load balancers with 4 servers behind each, each 
server listens on multiple ports and I now have the recursor running on 
2 threads (a new feature in 3.2).

The servers have no real load and cpu is mostly 95% idle, they have 8G 
or Ram and never go over 2G used by the whole OS (Debain Etch) and software.

Graphs show norms of between 20 and 40ms but then the spikes are 700ms 
and over, this then results in our external monitoring and scoring 
against other companies suffer and in the worst of circumstances become 
unavailable.

I realise that outside lookups will influence the results but its weird 
that when at their busiest they are more responsive than when its quiet 
and also have most of the unusual behaviour at that time.

Recursor performance graphing and dnsscope stats look OK although the 
average time to respon goes up by 100% overnight, see sample stats below 
from overnight/this morning :-

Timespan: 0.828056 hours
Saw 4049548 correct packets, 0 runts, 0 oversize, 0 unknown encaps, 99 
dns decoding errors, 0 bogus packets
3467 packets went unanswered, of which 1 were answered on exact retransmit
1047 answers could not be matched to questions
99 answers were unsatisfactory (indefinite, or SERVFAIL)
7764 answers (would be) discarded because older than 2 seconds
Rcode	Count
0	1482490
2	16215
3	166680
5	1
68.45% of questions answered within 50 usec (68.45%)
71.06% of questions answered within 100 usec (2.62%)
74.16% of questions answered within 200 usec (3.10%)
74.30% of questions answered within 250 usec (0.14%)
74.36% of questions answered within 300 usec (0.06%)
74.40% of questions answered within 350 usec (0.03%)
74.42% of questions answered within 400 usec (0.02%)
74.46% of questions answered within 800 usec (0.04%)
74.48% of questions answered within 1000 usec (0.02%)
77.90% of questions answered within 2.00 msec (3.42%)
79.80% of questions answered within 4.00 msec (1.90%)
80.72% of questions answered within 8.00 msec (0.92%)
84.03% of questions answered within 16.00 msec (3.31%)
85.88% of questions answered within 32.00 msec (1.85%)
87.11% of questions answered within 64.00 msec (1.23%)
93.06% of questions answered within 128.00 msec (5.95%)
96.86% of questions answered within 256.00 msec (3.80%)
98.37% of questions answered within 512.00 msec (1.50%)
98.79% of questions answered within 1024.00 msec (0.42%)
100.00% of questions answered within 2048.00 msec (1.21%)
Average response time: 40419.9 usec

As opposed to a run when everything is OK :-

Timespan: 0.381944 hours
Saw 3929598 correct packets, 0 runts, 0 oversize, 0 unknown encaps, 58 
dns decoding errors, 0 bogus packets
2098 packets went unanswered, of which 0 were answered on exact retransmit
4813 answers could not be matched to questions
58 answers were unsatisfactory (indefinite, or SERVFAIL)
1882 answers (would be) discarded because older than 2 seconds
Rcode	Count
0	1550451
2	7742
3	125547
5	16
70.36% of questions answered within 50 usec (70.36%)
73.27% of questions answered within 100 usec (2.91%)
76.53% of questions answered within 200 usec (3.26%)
76.82% of questions answered within 250 usec (0.29%)
76.96% of questions answered within 300 usec (0.14%)
77.04% of questions answered within 350 usec (0.08%)
77.09% of questions answered within 400 usec (0.05%)
77.18% of questions answered within 800 usec (0.09%)
77.20% of questions answered within 1000 usec (0.02%)
79.46% of questions answered within 2.00 msec (2.26%)
81.67% of questions answered within 4.00 msec (2.21%)
82.83% of questions answered within 8.00 msec (1.16%)
86.36% of questions answered within 16.00 msec (3.53%)
88.47% of questions answered within 32.00 msec (2.11%)
89.89% of questions answered within 64.00 msec (1.42%)
94.79% of questions answered within 128.00 msec (4.89%)
98.18% of questions answered within 256.00 msec (3.39%)
99.32% of questions answered within 512.00 msec (1.14%)
99.59% of questions answered within 1024.00 msec (0.28%)
100.00% of questions answered within 2048.00 msec (0.41%)
Average response time: 24119.3 usec

None of this behaviour was seen in either Unbound or Bind, we moved from 
these because of other limitations/security concerns but may have to 
look at moving back to Unbound if this persists.

Any help much appreciated.

Simon



More information about the Pdns-users mailing list