[Pdns-users] CPU Usage Regression in Recursor 4.9.1?

Sun Sep 3 10:41:35 UTC 2023

Hello!

We are running two recursor + dnsdist servers on Debian 12.
We upgraded them recently, 5 days apart, and see a consistent pattern
across both servers after the upgrade: The recursor CPU usage and drop 
rates [1] are significantly higher than usual and are increasing since 
the upgrade.

Is anyone else seeing the same after upgrading to 4.9.1?

Timeline of the 4.9.0 to 4.9.1 upgrades:

2023-08-26 ~21:00 server A upgraded
2023-08-31 ~22:00 server B upgraded

recursor drop rate graph as seen by dnsdists [1]
https://applied-privacy.net/files/tmp/recursor_4.9.1_drop_rate.png

The doted light blue vertical lines show when the upgrades happened.
The blue and violet lines show drop rates on server A as seen by
different dnsdist instances after the upgrade on 2023-08-26.
The red and green lines show drop rates of server B as seen by
multiple dnsdist instances after the upgrade on 2023-08-31.

The graphs are also supported by these log entries:

Number of times grep finds
'Timeout while waiting for the health check response from backend 
127.0.0.1:54'
per day.
The logfiles cover this week: 2023-08-27 00:00 - 2023-09-03 00:00

unfortunately we do not have
https://github.com/PowerDNS/pdns/pull/13009
deployed yet but we are really looking forward to dnsdist 1.9
and maybe we will test your master repo just to get these metrics earlier :)

Server A (upgraded on 2023-08-26 ~21:00 - no restarts):

      375 2023-08-27
      826 2023-08-28
     2690 2023-08-29
     3041 2023-08-30
     4608 2023-08-31
     6595 2023-09-01
     8047 2023-09-02

Server B (upgraded on 2023-08-31 ~22:00 - no restarts):

       63 2023-08-27
       51 2023-08-28
       90 2023-08-29
      110 2023-08-30
       54 2023-08-31
      349 2023-09-01
      757 2023-09-02

We also have graphs for various recursor metrics and they
show that the affected recursor servers get less queries over time (down 
to 1/4) because dnsdist gives them less queries and directs the queries 
to other resolvers instead. This is supported by looking at 
dnsdist_server_queries graphs.

We have not tried to downgrade to 4.9.0 yet to see if that solves the 
issue, but we might do so soon. We stay on 4.9.1 - at least on one 
server - for now so we can help get to the root cause of this in case 
you want us to perform any debugging steps.

I just restarted one of the affected recursors after ~7day uptime and 
that also appears to help a lot. The problem appears to increase with 
uptime, but cache size does not correlate with the drop rate.

The only graph that correlates with the increased drop rate is recursor 
CPU usage, which is higher on 4.9.1:
irate(pdns_recursor_sys_msec[$__rate_interval])
recursor CPU usage increased with uptime before 4.9.1 already but 
reached a plateau after about 10 days uptime. 4.9.1 reached that cpu 
usage level in just 3 days and continues to increase.

If it is relevant, we can also paste our recursor.conf file.

best regards,
Christoph

[1]
(rate(dnsdist_server_drops[$__rate_interval]) / 
rate(dnsdist_server_queries[$__rate_interval])) * 100