[Pdns-users] troubleshooting dnsdist -> recursor instability

Mon Oct 24 11:55:12 UTC 2022

Hi,

thanks for the pointers.

A coincidence helped getting a bit closer to the root cause maybe:
Due to a linux kernel update the server had to be rebooted,
after the reboot the problem disappeared.

Stacked graphs of
irate(pdns_recursor_sys_msec...
irate(pdns_recursor_user_msec...
show that the recursor CPU usage increased steadily over the 4 weeks 
uptime and dropped to 1/5 after reboot. At that level dnsdist's health 
checks do not fail anymore currently.

My first idea was: maybe the growing number of cache entries 
(pdns_recursor_cache_entries) take up more CPU resources over time to 
search thorough, but it takes only 3 hours (not 4 weeks) to fill up the 
cache and remains a flat line after that.
Also the NSEC cache entries (pdns_recursor_aggressive_nsec_cache_entries)
remains almost a flat line over these 4 weeks period while recursor's 
CPU usage grows.
Memory usage (pdns_recursor_real_memory_usage) grows a lot slower than
cache entries and correlates with CPU usage to some degree.

I also checked these metrics for correlations with the growing CPU usage 
but didn't find any:
negcache_entries
nsspeeds_entries
packetcache_entries
over_capacity_drops
query_pipe_full_drops

QPS (pdns_recursor_questions...)
slightly decreased during these 4 weeks.

By comparing two setups that are largely identical
we might have a hint, the one that has the growing CPU usage
issue has these 2 lines that the other one has NOT (thats the only 
difference):

loglevel=3
max-busy-dot-probes=5

The rate of pdns_recursor_dot_outqueries is not growing over time
on the one that has DoT probing enabled (and very low <5qps anyways).

If you also enabled DoT probing and are observing CPU usage growth over 
time, that would be interesting but unexpected to me.
https://blog.powerdns.com/2022/06/13/probing-dot-support-of-authoritative-servers-just-try-it/

Thomas Mieslinger via Pdns-users wrote:
> I'd use dnscap (tcpdump with a decent filter) on dnsdist and recursor
> machines. See if check query goes out from dnsdist, comes in to
> recursor, see if reply goes out from recursors, comes in to dnsdist.

thanks will use dnscap with -x <custom health check QNAME> when it 
happens again

> Review nftables config on all machines. Maybe someone of your team
> installed hashlimit magic to avoid overload.

no iptables/nftables involved on the loopback interface used to connect 
dnsdist with recursor

> Look for a metric which tells you whether you hit the "max in flight"
> limit. If you have long running queries (taking 1000ms in the recursor)
> the inflight limit can be reached quickly.

'answers-slow' is at about 5 qps.
I didn't find the "max in flight" metric yet.

The documentation on newServer() 
https://dnsdist.org/reference/config.html?highlight=newserver#newServer
does not mention the default value for:
checkInterval=NUM                          -- The time in seconds 
between health checks

Is it one per second?

thanks!
Christoph