[Pdns-users] public DoH/DoT dnsdist 1.9.8 exited on signal 11

Sat Feb 1 10:23:27 UTC 2025

Hello Christoph,

My two cents. Because crashes with dnsdist are really very unusual, I wouldn't rule out hardware errors. Especially because we've had that happen ourselves. In the end, bad memory was the cause of rare crashes. So I would recommend rebooting the server a few times, checking the management adapter for errors and running a memory check. That isn't much effort and it saves tremendous amount of time for everyone involved in case of a finding. 

Winfried 

Am 1. Februar 2025 10:06:35 MEZ schrieb Christoph via Pdns-users <pdns-users at mailman.powerdns.com>:
>Hi Remi,
>
>Remi Gacogne wrote:
>> I'm not aware of any bug in 1.9.8 that could cause a crash, no. It's hard to narrow it down with the information you have, unfortunately. Did you upgrade recently from a previous version of dnsdist (in which case I can look at the diff since the previous version)? Or perhaps did you change something in the configuration? I don't see anything out of the ordinary in your configuration.
>
>It turns out that we had more crashes before, but we did not notice them at the time because we only recently introduced monitoring for dnsdist restarts.
>And since the initial email we got to see one more crash.
>
>preliminary hypothesis
>----------------------
>This issue is more likely triggered starting with dnsdist 1.9.8
>when dnsdist distributes queries to multiple servers
>but not when dnsdist distributes queries to a single resolver only.
>
>We expect to see more crashes and maye we manage to get hold of a core dump file.
>
>Unfortunately we can not really do A / B tests to backup this hypothesis because we only have a single endpoint with actual user queries, but maybe you have an environment to confirm that hypothesis.
>
>
>We used the information below to come to that hypothesis.
>
>main server (bender)
>--------------------
>This is the server that gets all user queries under normal conditions.
>When it goes down for reboot (kernel updates, ...) or when we do software updates we do a failover to the second server (titanius) using CARP.
>
>Timeline of events:
>
>Jan  4 12:28:17 kernel: carp: MASTER -> BACKUP (user requested via ifconfig)
>Jan  4 12:30:57 pkg[39283]: dnsdist upgraded: 1.9.7 -> 1.9.8
>Jan  4 12:58:44 reboot
>Jan  4 13:09:15 kernel: carp: BACKUP -> MASTER
>*Jan 25 22:45 config change
>Jan 29 22:48:09 kernel: pid 75804 (dnsdist), jid 0, uid 208: exited on signal 11 (no core dump - bad address)
>Jan 30 14:47:25 kernel: pid 69716 (dnsdist), jid 0, uid 208: exited on signal 11 (no core dump - bad address)
>
>
>*)
>During 2025-01-04 - 2025-01-25 dnsdist send queries to a single resolver only because one of 2 resolvers was down.
>config change:
>on that timestamp we changed the configuration (added more resolvers) so we distribute queries to more than one server again. 4 days later the crash happened.
>
>
>failover server (titanius)
>----------------------------
>
>Jan  2 00:27:44 pkg[80907]: dnsdist upgraded: 1.9.7 -> 1.9.8
>**Jan 12 19:30:37 kernel: carp: BACKUP -> MASTER
>Jan 13 06:21:56 kernel: pid 52144 (dnsdist), jid 0, uid 208: exited on signal 11 (no core dump - bad address)
>Jan 13 22:40:27 kernel: pid 48868 (dnsdist), jid 0, uid 208: exited on signal 11 (no core dump - bad address)
>Jan 14 05:45:15 kernel: pid 36766 (dnsdist), jid 0, uid 208: exited on signal 11 (no core dump - bad address)
>
>**)
>Even though the dnsdist update happened already on 2025-01-02, this server only got actual user queries starting on 2025-01-12 because it is a failover server only.
>So on this server the crash happened within 12 hours of actual user queries on version 1.9.8.
>This server always had two recursors in the dnsdist config to send queries to and both were up.
>
>
>best regards,
>Christoph
>
>_______________________________________________
>Pdns-users mailing list
>Pdns-users at mailman.powerdns.com
>https://mailman.powerdns.com/mailman/listinfo/pdns-users