[dnsdist] dnsdist 1.7.4 Debian Bullseye vs 1.8.4 Bullseye

Aleš Rygl ales at rygl.net
Thu Oct 5 08:41:55 UTC 2023


Hi Remi,

On 02. 10. 23 13:53, Remi Gacogne via dnsdist wrote:
> Hi Ales,
>
> On 25/09/2023 16:09, Aleš Rygl via dnsdist wrote:
>>>     I would to kindly ask for help or and advice. I have just 
>>> upgraded one of our dnsdist instances from 1.7.4 do 1.8.4 together 
>>> with OS upgrade (Debian 11.7 to 12.1). Everything works fine, no 
>>> issues observed apart some deprecated config references. What is a 
>>> big surprise to me is CPU usage. The newer version has nearly two 
>>> times higher CPU consumption in userspace. I am nearly at 80% CPU 
>>> with 16 physical cores (was about 40%). We have a lot of TLS (DoT) 
>>> sessions (30k) and 60kqps in total (30k via DoT) here. The latency 
>>> measured by dnsdist went up also. We are collecting all the metrics 
>>> dnsdist produces via graphite so I can check counters, what could be 
>>> wrong.
>
> Wow, that's awful. It's the first time I hear about such a regression, 
> and I really would like to understand what is going on.
> 1/ Are you using our packages, compiling yourself, or perhaps using 
> the Debian ones?
> 2/ Do you think it would be possible for you to try downgrading the 
> instance to 1.7.4 on Debian 12.1? It might help us pinpointing whether 
> the issue is related to a system change (I have seen people complain 
> about the performance of OpenSSL 3.0.x compared to 1.1.1x, for example).
> 3/ Would you mind sharing your configuration?
> 4/ And finally, do you think it would be possible for you to collect a 
> perf trace on this instance? It would require installing linux-perf, 
> if possible the debug symbols for dnsdist (dnsdist-dbgsym) then 
> running 'perf record --call-graph dwarf -p <pid of running dnsdist 
> process> -o </path/to/output/file>' for a few dozens of seconds to 
> collect a trace, stopping it with Ctrl+C and finally getting a report 
> with "perf report -i </path/to/previous/file> --stdio". It should tell 
> us where the CPU usage is going.
>
> Best regards,
>
     Thanks for your response. After some deep documentation reading and 
config tweaking I am nearly on the previous values regarding CPU load, 
apart from latency, which is still higher (1.3ms -> 2.3ms). I suspect a 
different way the latency is likely computed (I noticed a new set of 
latency counters for TLS, TCP, etc.) here.  The key configuration 
parameter is setMaxTCPClientThreads(). Changing anything else (cache 
shards, number of listeners, etc.) has nearly no impact. We had 256 with 
1.7.4. now it is 16. Going up here means a rapid increase of CPU load, 
having less than 16 means dropping TCP connections in showTCPStats(), 
where Queued hits Max Queued. Insane values like 1024 kills the CPU. We 
have a physical server with 16 phys. cores, OS sees 32 cores.

Back to your questions:

1/ from your repos
2/ yes, I could try it, the thing is that 1.7.4 for Bullseye crashes on 
Bookworm wit TLS enabled and there a no packages of 1.7.4 for Bookworm 
in your repo
3/ sure, I will do so
4/ no problem

Best regards

Ales







More information about the dnsdist mailing list