[Pdns-users] CPU Usage Regression in Recursor 4.9.1?

Otto Moerbeek otto at drijf.net
Sun Sep 3 13:31:25 UTC 2023


Hello,

I have ne clue yet what could cause this. Looking through the
changelog between 4.9.0 and 4.9.1 I do not see changes that are
expected to affect CPU usage. 

I would suggest you do the planned revert to 4.9.0 for at least one
recursor to see if the problem disappears.

Please post your recursor.conf and lua config file if used.

Thanks,

	-Otto

On Sun, Sep 03, 2023 at 12:41:35PM +0200, Christoph via Pdns-users wrote:

> Hello!
> 
> We are running two recursor + dnsdist servers on Debian 12.
> We upgraded them recently, 5 days apart, and see a consistent pattern
> across both servers after the upgrade: The recursor CPU usage and drop rates
> [1] are significantly higher than usual and are increasing since the
> upgrade.
> 
> Is anyone else seeing the same after upgrading to 4.9.1?
> 
> Timeline of the 4.9.0 to 4.9.1 upgrades:
> 
> 2023-08-26 ~21:00 server A upgraded
> 2023-08-31 ~22:00 server B upgraded
> 
> recursor drop rate graph as seen by dnsdists [1]
> https://applied-privacy.net/files/tmp/recursor_4.9.1_drop_rate.png
> 
> The doted light blue vertical lines show when the upgrades happened.
> The blue and violet lines show drop rates on server A as seen by
> different dnsdist instances after the upgrade on 2023-08-26.
> The red and green lines show drop rates of server B as seen by
> multiple dnsdist instances after the upgrade on 2023-08-31.
> 
> The graphs are also supported by these log entries:
> 
> Number of times grep finds
> 'Timeout while waiting for the health check response from backend
> 127.0.0.1:54'
> per day.
> The logfiles cover this week: 2023-08-27 00:00 - 2023-09-03 00:00
> 
> unfortunately we do not have
> https://github.com/PowerDNS/pdns/pull/13009
> deployed yet but we are really looking forward to dnsdist 1.9
> and maybe we will test your master repo just to get these metrics earlier :)
> 
> Server A (upgraded on 2023-08-26 ~21:00 - no restarts):
> 
>      375 2023-08-27
>      826 2023-08-28
>     2690 2023-08-29
>     3041 2023-08-30
>     4608 2023-08-31
>     6595 2023-09-01
>     8047 2023-09-02
> 
> Server B (upgraded on 2023-08-31 ~22:00 - no restarts):
> 
>       63 2023-08-27
>       51 2023-08-28
>       90 2023-08-29
>      110 2023-08-30
>       54 2023-08-31
>      349 2023-09-01
>      757 2023-09-02
> 
> We also have graphs for various recursor metrics and they
> show that the affected recursor servers get less queries over time (down to
> 1/4) because dnsdist gives them less queries and directs the queries to
> other resolvers instead. This is supported by looking at
> dnsdist_server_queries graphs.
> 
> We have not tried to downgrade to 4.9.0 yet to see if that solves the issue,
> but we might do so soon. We stay on 4.9.1 - at least on one server - for now
> so we can help get to the root cause of this in case you want us to perform
> any debugging steps.
> 
> I just restarted one of the affected recursors after ~7day uptime and that
> also appears to help a lot. The problem appears to increase with uptime, but
> cache size does not correlate with the drop rate.
> 
> The only graph that correlates with the increased drop rate is recursor CPU
> usage, which is higher on 4.9.1:
> irate(pdns_recursor_sys_msec[$__rate_interval])
> recursor CPU usage increased with uptime before 4.9.1 already but reached a
> plateau after about 10 days uptime. 4.9.1 reached that cpu usage level in
> just 3 days and continues to increase.
> 
> If it is relevant, we can also paste our recursor.conf file.
> 
> best regards,
> Christoph
> 
> [1]
> (rate(dnsdist_server_drops[$__rate_interval]) /
> rate(dnsdist_server_queries[$__rate_interval])) * 100
> _______________________________________________
> Pdns-users mailing list
> Pdns-users at mailman.powerdns.com
> https://mailman.powerdns.com/mailman/listinfo/pdns-users


More information about the Pdns-users mailing list