[Pdns-users] troubleshooting dnsdist -> recursor instability
Christoph
cm at appliedprivacy.net
Mon Oct 24 11:55:12 UTC 2022
Hi,
thanks for the pointers.
A coincidence helped getting a bit closer to the root cause maybe:
Due to a linux kernel update the server had to be rebooted,
after the reboot the problem disappeared.
Stacked graphs of
irate(pdns_recursor_sys_msec...
irate(pdns_recursor_user_msec...
show that the recursor CPU usage increased steadily over the 4 weeks
uptime and dropped to 1/5 after reboot. At that level dnsdist's health
checks do not fail anymore currently.
My first idea was: maybe the growing number of cache entries
(pdns_recursor_cache_entries) take up more CPU resources over time to
search thorough, but it takes only 3 hours (not 4 weeks) to fill up the
cache and remains a flat line after that.
Also the NSEC cache entries (pdns_recursor_aggressive_nsec_cache_entries)
remains almost a flat line over these 4 weeks period while recursor's
CPU usage grows.
Memory usage (pdns_recursor_real_memory_usage) grows a lot slower than
cache entries and correlates with CPU usage to some degree.
I also checked these metrics for correlations with the growing CPU usage
but didn't find any:
negcache_entries
nsspeeds_entries
packetcache_entries
over_capacity_drops
query_pipe_full_drops
QPS (pdns_recursor_questions...)
slightly decreased during these 4 weeks.
By comparing two setups that are largely identical
we might have a hint, the one that has the growing CPU usage
issue has these 2 lines that the other one has NOT (thats the only
difference):
loglevel=3
max-busy-dot-probes=5
The rate of pdns_recursor_dot_outqueries is not growing over time
on the one that has DoT probing enabled (and very low <5qps anyways).
If you also enabled DoT probing and are observing CPU usage growth over
time, that would be interesting but unexpected to me.
https://blog.powerdns.com/2022/06/13/probing-dot-support-of-authoritative-servers-just-try-it/
Thomas Mieslinger via Pdns-users wrote:
> I'd use dnscap (tcpdump with a decent filter) on dnsdist and recursor
> machines. See if check query goes out from dnsdist, comes in to
> recursor, see if reply goes out from recursors, comes in to dnsdist.
thanks will use dnscap with -x <custom health check QNAME> when it
happens again
> Review nftables config on all machines. Maybe someone of your team
> installed hashlimit magic to avoid overload.
no iptables/nftables involved on the loopback interface used to connect
dnsdist with recursor
> Look for a metric which tells you whether you hit the "max in flight"
> limit. If you have long running queries (taking 1000ms in the recursor)
> the inflight limit can be reached quickly.
'answers-slow' is at about 5 qps.
I didn't find the "max in flight" metric yet.
The documentation on newServer()
https://dnsdist.org/reference/config.html?highlight=newserver#newServer
does not mention the default value for:
checkInterval=NUM -- The time in seconds
between health checks
Is it one per second?
thanks!
Christoph
More information about the Pdns-users
mailing list