[Pdns-users] Spikey response times in powerdns recursor
Simon Bedford
sbedford at plus.net
Wed Mar 17 10:43:19 UTC 2010
Hi guys,
Apologies if this has been discussed before but as a new mailling list
user I have not seen anything.
We have been running recursor as a caching name server for a number of
months having moved from unbound, since this time we see good, in fact
quick DNS response time but then when running 3.1.7.1 and .2 and also
3.2.1 we see random spikes up to 2 seconds for the response times often
at the quietest of times for the name servers.
I had put this down to 3.1 version after reaading the changelog and bugs
fixed in 3.2 but having upgraded we still see the same spiking, this
time more frequent over night but not quite as severe as they were.
We are using hardware load balancers with 4 servers behind each, each
server listens on multiple ports and I now have the recursor running on
2 threads (a new feature in 3.2).
The servers have no real load and cpu is mostly 95% idle, they have 8G
or Ram and never go over 2G used by the whole OS (Debain Etch) and software.
Graphs show norms of between 20 and 40ms but then the spikes are 700ms
and over, this then results in our external monitoring and scoring
against other companies suffer and in the worst of circumstances become
unavailable.
I realise that outside lookups will influence the results but its weird
that when at their busiest they are more responsive than when its quiet
and also have most of the unusual behaviour at that time.
Recursor performance graphing and dnsscope stats look OK although the
average time to respon goes up by 100% overnight, see sample stats below
from overnight/this morning :-
Timespan: 0.828056 hours
Saw 4049548 correct packets, 0 runts, 0 oversize, 0 unknown encaps, 99
dns decoding errors, 0 bogus packets
3467 packets went unanswered, of which 1 were answered on exact retransmit
1047 answers could not be matched to questions
99 answers were unsatisfactory (indefinite, or SERVFAIL)
7764 answers (would be) discarded because older than 2 seconds
Rcode Count
0 1482490
2 16215
3 166680
5 1
68.45% of questions answered within 50 usec (68.45%)
71.06% of questions answered within 100 usec (2.62%)
74.16% of questions answered within 200 usec (3.10%)
74.30% of questions answered within 250 usec (0.14%)
74.36% of questions answered within 300 usec (0.06%)
74.40% of questions answered within 350 usec (0.03%)
74.42% of questions answered within 400 usec (0.02%)
74.46% of questions answered within 800 usec (0.04%)
74.48% of questions answered within 1000 usec (0.02%)
77.90% of questions answered within 2.00 msec (3.42%)
79.80% of questions answered within 4.00 msec (1.90%)
80.72% of questions answered within 8.00 msec (0.92%)
84.03% of questions answered within 16.00 msec (3.31%)
85.88% of questions answered within 32.00 msec (1.85%)
87.11% of questions answered within 64.00 msec (1.23%)
93.06% of questions answered within 128.00 msec (5.95%)
96.86% of questions answered within 256.00 msec (3.80%)
98.37% of questions answered within 512.00 msec (1.50%)
98.79% of questions answered within 1024.00 msec (0.42%)
100.00% of questions answered within 2048.00 msec (1.21%)
Average response time: 40419.9 usec
As opposed to a run when everything is OK :-
Timespan: 0.381944 hours
Saw 3929598 correct packets, 0 runts, 0 oversize, 0 unknown encaps, 58
dns decoding errors, 0 bogus packets
2098 packets went unanswered, of which 0 were answered on exact retransmit
4813 answers could not be matched to questions
58 answers were unsatisfactory (indefinite, or SERVFAIL)
1882 answers (would be) discarded because older than 2 seconds
Rcode Count
0 1550451
2 7742
3 125547
5 16
70.36% of questions answered within 50 usec (70.36%)
73.27% of questions answered within 100 usec (2.91%)
76.53% of questions answered within 200 usec (3.26%)
76.82% of questions answered within 250 usec (0.29%)
76.96% of questions answered within 300 usec (0.14%)
77.04% of questions answered within 350 usec (0.08%)
77.09% of questions answered within 400 usec (0.05%)
77.18% of questions answered within 800 usec (0.09%)
77.20% of questions answered within 1000 usec (0.02%)
79.46% of questions answered within 2.00 msec (2.26%)
81.67% of questions answered within 4.00 msec (2.21%)
82.83% of questions answered within 8.00 msec (1.16%)
86.36% of questions answered within 16.00 msec (3.53%)
88.47% of questions answered within 32.00 msec (2.11%)
89.89% of questions answered within 64.00 msec (1.42%)
94.79% of questions answered within 128.00 msec (4.89%)
98.18% of questions answered within 256.00 msec (3.39%)
99.32% of questions answered within 512.00 msec (1.14%)
99.59% of questions answered within 1024.00 msec (0.28%)
100.00% of questions answered within 2048.00 msec (0.41%)
Average response time: 24119.3 usec
None of this behaviour was seen in either Unbound or Bind, we moved from
these because of other limitations/security concerns but may have to
look at moving back to Unbound if this persists.
Any help much appreciated.
Simon
More information about the Pdns-users
mailing list