[Pdns-users] CPU Usage Regression in Recursor 4.9.1?
Christoph
cm at appliedprivacy.net
Sun Sep 3 10:41:35 UTC 2023
Hello!
We are running two recursor + dnsdist servers on Debian 12.
We upgraded them recently, 5 days apart, and see a consistent pattern
across both servers after the upgrade: The recursor CPU usage and drop
rates [1] are significantly higher than usual and are increasing since
the upgrade.
Is anyone else seeing the same after upgrading to 4.9.1?
Timeline of the 4.9.0 to 4.9.1 upgrades:
2023-08-26 ~21:00 server A upgraded
2023-08-31 ~22:00 server B upgraded
recursor drop rate graph as seen by dnsdists [1]
https://applied-privacy.net/files/tmp/recursor_4.9.1_drop_rate.png
The doted light blue vertical lines show when the upgrades happened.
The blue and violet lines show drop rates on server A as seen by
different dnsdist instances after the upgrade on 2023-08-26.
The red and green lines show drop rates of server B as seen by
multiple dnsdist instances after the upgrade on 2023-08-31.
The graphs are also supported by these log entries:
Number of times grep finds
'Timeout while waiting for the health check response from backend
127.0.0.1:54'
per day.
The logfiles cover this week: 2023-08-27 00:00 - 2023-09-03 00:00
unfortunately we do not have
https://github.com/PowerDNS/pdns/pull/13009
deployed yet but we are really looking forward to dnsdist 1.9
and maybe we will test your master repo just to get these metrics earlier :)
Server A (upgraded on 2023-08-26 ~21:00 - no restarts):
375 2023-08-27
826 2023-08-28
2690 2023-08-29
3041 2023-08-30
4608 2023-08-31
6595 2023-09-01
8047 2023-09-02
Server B (upgraded on 2023-08-31 ~22:00 - no restarts):
63 2023-08-27
51 2023-08-28
90 2023-08-29
110 2023-08-30
54 2023-08-31
349 2023-09-01
757 2023-09-02
We also have graphs for various recursor metrics and they
show that the affected recursor servers get less queries over time (down
to 1/4) because dnsdist gives them less queries and directs the queries
to other resolvers instead. This is supported by looking at
dnsdist_server_queries graphs.
We have not tried to downgrade to 4.9.0 yet to see if that solves the
issue, but we might do so soon. We stay on 4.9.1 - at least on one
server - for now so we can help get to the root cause of this in case
you want us to perform any debugging steps.
I just restarted one of the affected recursors after ~7day uptime and
that also appears to help a lot. The problem appears to increase with
uptime, but cache size does not correlate with the drop rate.
The only graph that correlates with the increased drop rate is recursor
CPU usage, which is higher on 4.9.1:
irate(pdns_recursor_sys_msec[$__rate_interval])
recursor CPU usage increased with uptime before 4.9.1 already but
reached a plateau after about 10 days uptime. 4.9.1 reached that cpu
usage level in just 3 days and continues to increase.
If it is relevant, we can also paste our recursor.conf file.
best regards,
Christoph
[1]
(rate(dnsdist_server_drops[$__rate_interval]) /
rate(dnsdist_server_queries[$__rate_interval])) * 100
More information about the Pdns-users
mailing list