[Pdns-users] Reg. PDNS recursor Ver 4.1.16

Wed Dec 9 08:48:11 UTC 2020

On 09/12/2020 07:30, Kiran Kumar via Pdns-users wrote:
> How do we minimize answers-slow, We are running on CentOS Linux 
> release 7.9.2009 (Core)
> on VM with 4VCPUs and 16GB RAM.
>
> rec_control get-all | grep answer
> *answers-slow    80903*
> answers0-1      598471
> answers1-10     1057756
> answers10-100   2342082
> answers100-1000 1341675

For explanation see: 
https://docs.powerdns.com/recursor/metrics.html#gathered-information

answers-slow is queries answered after more than 1 second, and in your 
case represent 1.5% of answers, except you've not shown packetcache-hits 
so the fraction of client queries affected will likely be far less than 
that.

In resolving a given query, the recursor is going to have to contact one 
or more authoritative nameservers on the Internet. These are some 
reasons why it might take more than 1 second to get the final answer:

- the answer is not already in cache (obviously) - this happens more 
frequently if there is low TTL in the authoritative server for that 
domain; AND
- the first authoritative server tried is down (or transient network 
problem to that server), so pdns times out and tries another one; OR
- multiple authoritative servers need to be contacted, with a large 
round-trip time to each; OR
- the client is querying for a domain which is completely lame / broken 
and cannot find any answer.

This doesn't necessarily indicate a problem with your own pdns server at 
all.  It could just as well be problems with some authoritative domains 
on the Internet. Heaven knows there are plenty of broken domains out 
there :-)

It could however be made worse by packet loss or congestion on your 
network or your network's upstream link.  If your recursor is on a 
private IP address behind a NAT, it would be better to put it on a 
public IP address, so that it doesn't have to generate NAT state for 
every outbound query it makes.  If your uplink is congested, which will 
cause latency and packet loss, then there's not much you can do short of 
buying more bandwidth.

It could be made worse by excessive load on your server causing it to 
fall behind or drop queries, or insufficient RAM causing it to kick out 
cache entries prematurely, so you should also use a suitable tool to 
monitor your server resource utilisation (netdata 
<https://github.com/netdata/netdata> is very good for this, monitoring 
at 1-second resolution by default so lets you see short bursts of 
activity).  However, your server may be completely fine.

For comparison, here's the tiny cache on my home network:

root at cache1:~# rec_control get-all | egrep 
'^(answers|packetcache-hits|over-capacity-drops|policy-drops)'
answers-slow    348
answers0-1    6118
answers1-10    7149
answers10-100    9074
answers100-1000    4695
over-capacity-drops    0
packetcache-hits    1983665
policy-drops    0

and here's a production DNS cache in a data centre:

root at wrn-dns1:~# rec_control get-all | egrep 
'^(answers|packetcache-hits|over-capacity-drops|policy-drops)'
answers-slow    1710185
answers0-1    40045388
answers1-10    132638392
answers10-100    101328465
answers100-1000    11033827
over-capacity-drops    0
packetcache-hits    8907014600
policy-drops    0

The fraction of answers-slow out of answersXXXX is not hugely different 
from what you see. Also notice that packetcache-hits is far higher again.

Regards,

Brian.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.powerdns.com/pipermail/pdns-users/attachments/20201209/2ae77318/attachment-0001.htm>