[Pdns-users] pdns-recursor metrics review and tuning advice request
Otto Moerbeek
otto at drijf.net
Sat Apr 19 07:29:11 UTC 2025
Remarks inline.
On Fri, Apr 18, 2025 at 07:04:18PM -0400, Scott Crace wrote:
> Otto,
> Thanks for your assistance.Since these were setup with private IPs I wasn't
> sure how useful the config would be however, I have included it below.
>
> # rec_control dump-throttlemap -
> ; throttle map dump follows
> ; remote IP qname qtype count ttd reason
> 10.0.196.197 0.10.in-addr.arpa A 2 2025-04-18T18:44:22
> RCodeRefused
> 10.0.196.197 10.10.in-addr.arpa A 3 2025-04-18T18:44:25
> RCodeRefused
> 10.0.196.197 255.10.in-addr.arpa A 1 2025-04-18T18:44:23
> RCodeRefused
> 10.0.62.244 0.10.in-addr.arpa A 2 2025-04-18T18:44:22
> RCodeRefused
> 10.0.62.244 10.10.in-addr.arpa A 3 2025-04-18T18:44:25
> RCodeRefused
> 10.0.62.244 255.10.in-addr.arpa A 2 2025-04-18T18:44:23
> RCodeRefused
> dump-throttlemap: dumped 6 records
Looking at your config below, You are forwarding to servers that do not
want to answers those queries. Make sure you either do not forward or
change the auths to respond properly. "Refused" means the auth does
not have the particular zone. An auth responding Refused on a lot of
queries will be throttled for those specific queries.
>
> # rec_control dump-failedservers -
> I removed any count 1 or 2 for brevity since this email is already a long
> read.
> ; failed servers dump follows
> ; remote IP count timestamp
> 203.119.25.5 8 2025-04-18T18:43:44
> 203.119.26.5 8 2025-04-18T18:43:42
> 203.119.27.5 8 2025-04-18T18:43:41
> 203.119.28.5 8 2025-04-18T18:43:39
> 203.119.29.5 8 2025-04-18T18:43:45
> 200.189.41.10 7 2025-04-18T18:42:46
> 200.219.148.10 6 2025-04-18T18:39:47
> 200.219.154.10 6 2025-04-18T18:42:43
> 200.219.159.10 7 2025-04-18T18:42:45
> 200.192.233.10 7 2025-04-18T18:42:40
> 200.229.248.10 4 2025-04-18T18:42:42
> 203.119.95.53 3 2025-04-18T18:39:30
> 203.119.86.101 1229 2025-04-18T18:40:03
> 35.173.255.124 4895 2025-04-18T18:36:21
> dump-failedservers: dumped 43 records
Depending on how long your recursor is running, some of these counts
are pretty high. This *might* indicate connectivity issues, but no
defnite conclusion, some network trouble shooting might be in place
esepcially as 203.119.86.101 is ns3.apnic.net, which *should* be a
server that's reachable and responding properly. 35.173.255.124 looks
like a random aws IP.
>
>
> Config(s)
>
> Please note that one of the zones forwarding is 'split brained' from a
> legacy setup. The zone consists of a private Active Directory environment
> and a separately maintained public zone. The configuration forwards to the
> private AD servers and I believe the lua script drops queries that have no
> match in that zone. The public zone is being slowly phased out.
>
> I noted while reviewing the previous server configs and found a comment
> about this value but no context for the specific reasoning. This may
> explain the values you noted but I would like to understand the
> implications of removing it. It doesn't seem like something that should
> have been enabled.
> # https://github.com/PowerDNS/pdns/issues/6186
> max-negative-ttl=0
That is indeed potentially killing performance. Better leave it at the
default, unless you have very specific reasons to change it. In
practise any DNS server spends quite a lot of it's time answering
negatively. Not caching negative answer will cause quite a lot of work
since the recursor will need to contacts auths for each client query
that will lead to a negative answer again and again.
A common cause to dislike negative caching is (for a name in a locally
managed zone):
1. Query rec for a name and see that it does not exist (NODATA answer)
2. Modify the auth zone so the name exists
3. Query again and see that it still does not exist because of negative
caching in rec.
The answer to this is not to "disable negative chaching". The proper
answer is: avoid the initial query, have some patience or flush the
rec cache for that name by using rec_control or sending rec a notify
(notify rec is a relative new feature, and needs to be set up to allow
it, see
https://docs.powerdns.com/recursor/yamlsettings.html#incoming-allow-notify-from).
>
> /etc/pdns-recursor/recursor.conf
>
> ---
>
> dnssec:
>
> validation: validate
>
> incoming:
>
> allow_from:
>
> - 127.0.0.1/8
>
> - 10.0.0.0/8
>
> - 172.16.0.0/12
>
> - 192.168.0.0/16
>
> - 'fd00::/8'
>
> - '2607:B600::/32'
>
> listen:
>
> - 0.0.0.0
>
> max_tcp_clients: 128
>
> max_tcp_per_client: 0
>
> max_tcp_queries_per_connection: 0
>
> port: 53
>
> tcp_timeout: 2
>
> outgoing:
>
> dont_query: []
>
> max_qperq: 50
>
> network_timeout: 1500
>
> packetcache:
>
> max_entries: 1000000
>
> recordcache:
>
> max_entries: 1000000
>
> max_negative_ttl: 0
>
> max_ttl: 86400
>
> recursor:
>
> daemon: false
>
> forward_zones:
>
> - zone: momentumbusiness.com
>
> recurse: false
>
> forwarders:
>
> - 10.255.255.76
>
> - 10.1.3.228
>
> - zone: 10.in-addr.arpa
>
> recurse: false
>
> forwarders:
>
> - 10.0.196.197
>
> - 10.0.62.244
>
> - zone: 168.192.in-addr.arpa
>
> recurse: false
>
> forwarders:
>
> - 10.0.196.197
>
> - 10.0.62.244
>
> - zone: 16.172.in-addr.arpa
>
> recurse: false
>
> forwarders:
>
> - 10.0.196.197
>
> - 10.0.62.244
>
> lua_dns_script: /etc/pdns-recursor/momentumbusiness_com.lua
>
> max_recursion_depth: 40
>
> max_total_msec: 7000
>
> minimum_ttl_override: 1
>
> server_id: nsres01.momentumtelecom.com
>
> setgid: pdns-recursor
>
> setuid: pdns-recursor
>
> webservice:
>
> address: 0.0.0.0
>
> allow_from:
>
> - 192.168.9.164
>
> - 192.168.21.134
>
> - 192.168.20.0/24
>
> api_key: <sanitized>
>
> port: 8080
>
> webserver: true
>
> logging:
>
> loglevel: 3
>
> ...
>
> /etc/pdns-recursor/momentumbusiness_com.lua
> pdnslog("Lua NXDomain filter for momentumbusiness.com loading...",
> pdns.loglevels.Notice)
> nxdomainsuffix=newDN("momentumbusiness.com")
> function nxdomain(dq)
> if dq.qname:isPartOf(nxdomainsuffix)
> then
> dq.appliedPolicy.policyKind = pdns.policykinds.Drop
> return true
> end
> return false
> end
I do wonder what's the purpose of this special nxdoamin handling is. A
drop is not nice to clients, as the query will timeout out from their
perspective. Maybe pdns.policykinds.NODATA or just leaving the special
handling out?
>
> On Fri, Apr 18, 2025 at 9:39 AM Otto Moerbeek <otto at drijf.net> wrote:
>
> > On Fri, Apr 18, 2025 at 08:28:48AM -0400, Scott Crace via Pdns-users wrote:
> >
> > Hi,
> >
> > Please include your config. That said:
> >
> > You seem to have pretty low cache hit ratio, a high number of outgoing
> > queries. How is your cache configged?
> >
> > Also some throttling is going on. I suspect rec has trouble contacting
> > one or more auths or forwarders. The throttling tables can be viewed
> > using
> >
> > rec_control dump-throttlemap -
> > rec_control dump-failedservers -
> >
> > Also, what happens *during* the trace can be very relevant. If one
> > auth (or forwarder) does not respond, rec will turn to another one,
> > but only after the timeout of 1500ms by default.
> >
> > -Otto
> >
> > > Hello all,
> > > Long time lurker on the message list and would like some performance
> > > and/or tuning advice.
> > > We've been using pdns-recursor as internal recursive nameservers for
> > quite
> > > some time now.
> > > The original implementer of pdns departed and I was recently tasked with
> > > replacing or upgrading all of the servers with newer RHEL9 versions. I
> > > opted to build fresh and migrate the configuration to the latest 5.2
> > > release.
> > >
> > > I'm hearing occasional complaints about odd issues and/or clients cycling
> > > through their DNS servers rapidly (timeouts?). Manual testing DNS works
> > but
> > > I am reading through the metrics and performance documentation. I am
> > hoping
> > > someone with a more experienced eye could take a look at a sampling of
> > the
> > > periodic statistics report (below) and provide some insight or
> > > prioritization on any urgent issues I should focus on studying first.
> > >
> > > My observations:
> > > * I do note that the performance documentation talks about
> > > firewalld/stateful firewalls impact but the legacy servers were using the
> > > same basic setup. If the firewall is the problem is there a way to
> > validate
> > > this (other than stopping firewalld and waiting)?
> > > * The "worker" threads seem evenly distributed to my novice eye and our
> > qps
> > > (queries per second) rate is low as I would expect since the name servers
> > > are internal only resources.
> > > * I ran a few pcaps and rec_control trace-regex for specific domain items
> > > being reported as problematic. Everything seemed to be working with the
> > > trace-regex always showing "Step3 Final resolve: No Error/6 or 8".
> > >
> > > Thank you in advance for your time and consideration.
> > >
> > > Sincerely,
> > > Scotsie
> > >
> > > ```
> > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Periodic
> > statistics
> > > report" subsystem="stats" level="0" prio="Info" tid="0"
> > ts="1744920448.170"
> > > cache-entries="23666" negcache-entries="497" questions="6831695"
> > > record-cache-acquired="286931329" record-cache-contended="64414"
> > > record-cache-contended-perc="0.02" record-cache-hitratio-perc="0.87"
> > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Periodic
> > statistics
> > > report" subsystem="stats" level="0" prio="Info" tid="0"
> > ts="1744920448.170"
> > > packetcache-acquired="16887684" packetcache-contended="1019"
> > > packetcache-contended-perc="0.01" packetcache-entries="7112"
> > > packetcache-hitratio-perc="37.75"
> > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Periodic
> > statistics
> > > report" subsystem="stats" level="0" prio="Info" tid="0"
> > ts="1744920448.170"
> > > edns-entries="38" failed-host-entries="50"
> > > non-resolving-nameserver-entries="0" nsspeed-entries="968"
> > > saved-parent-ns-sets-entries="65" throttle-entries="8"
> > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Periodic
> > statistics
> > > report" subsystem="stats" level="0" prio="Info" tid="0"
> > ts="1744920448.170"
> > > concurrent-queries="1" dot-outqueries="0" idle-tcpout-connections="0"
> > > outgoing-timeouts="36594" outqueries="14668546"
> > > outqueries-per-query-perc="214.71" tcp-outqueries="3131"
> > > throttled-queries-perc="1.90"
> > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Periodic
> > statistics
> > > report" subsystem="stats" level="0" prio="Info" tid="0"
> > ts="1744920448.170"
> > > taskqueue-expired="0" taskqueue-pushed="540" taskqueue-size="0"
> > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Queries handled by
> > > thread" subsystem="stats" level="0" prio="Info" tid="0"
> > ts="1744920448.170"
> > > count="3470098" thread="0" tname="worker"
> > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Queries handled by
> > > thread" subsystem="stats" level="0" prio="Info" tid="0"
> > ts="1744920448.170"
> > > count="3360836" thread="1" tname="worker"
> > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Queries handled by
> > > thread" subsystem="stats" level="0" prio="Info" tid="0"
> > ts="1744920448.171"
> > > count="764" thread="2" tname="tcpworker"
> > > Apr 17 16:07:28 nsrecdns01-1 pdns-recursor[1092]: msg="Periodic QPS
> > report"
> > > subsystem="stats" level="0" prio="Info" tid="0" ts="1744920448.171"
> > > averagedOver="1800" qps="117"
> > > ```
> >
> > > _______________________________________________
> > > Pdns-users mailing list
> > > Pdns-users at mailman.powerdns.com
> > > https://mailman.powerdns.com/mailman/listinfo/pdns-users
> >
> >
More information about the Pdns-users
mailing list