[Pdns-users] public DoH/DoT dnsdist 1.9.8 exited on signal 11
Christoph
cm at appliedprivacy.net
Sat Feb 1 09:06:35 UTC 2025
Hi Remi,
Remi Gacogne wrote:
> I'm not aware of any bug in 1.9.8 that could cause a crash, no. It's
> hard to narrow it down with the information you have, unfortunately. Did
> you upgrade recently from a previous version of dnsdist (in which case I
> can look at the diff since the previous version)? Or perhaps did you
> change something in the configuration? I don't see anything out of the
> ordinary in your configuration.
It turns out that we had more crashes before, but we did not notice them
at the time because we only recently introduced monitoring for dnsdist
restarts.
And since the initial email we got to see one more crash.
preliminary hypothesis
----------------------
This issue is more likely triggered starting with dnsdist 1.9.8
when dnsdist distributes queries to multiple servers
but not when dnsdist distributes queries to a single resolver only.
We expect to see more crashes and maye we manage to get hold of a core
dump file.
Unfortunately we can not really do A / B tests to backup this hypothesis
because we only have a single endpoint with actual user queries, but
maybe you have an environment to confirm that hypothesis.
We used the information below to come to that hypothesis.
main server (bender)
--------------------
This is the server that gets all user queries under normal conditions.
When it goes down for reboot (kernel updates, ...) or when we do
software updates we do a failover to the second server (titanius) using
CARP.
Timeline of events:
Jan 4 12:28:17 kernel: carp: MASTER -> BACKUP (user requested via ifconfig)
Jan 4 12:30:57 pkg[39283]: dnsdist upgraded: 1.9.7 -> 1.9.8
Jan 4 12:58:44 reboot
Jan 4 13:09:15 kernel: carp: BACKUP -> MASTER
*Jan 25 22:45 config change
Jan 29 22:48:09 kernel: pid 75804 (dnsdist), jid 0, uid 208: exited on
signal 11 (no core dump - bad address)
Jan 30 14:47:25 kernel: pid 69716 (dnsdist), jid 0, uid 208: exited on
signal 11 (no core dump - bad address)
*)
During 2025-01-04 - 2025-01-25 dnsdist send queries to a single resolver
only because one of 2 resolvers was down.
config change:
on that timestamp we changed the configuration (added more resolvers) so
we distribute queries to more than one server again. 4 days later the
crash happened.
failover server (titanius)
----------------------------
Jan 2 00:27:44 pkg[80907]: dnsdist upgraded: 1.9.7 -> 1.9.8
**Jan 12 19:30:37 kernel: carp: BACKUP -> MASTER
Jan 13 06:21:56 kernel: pid 52144 (dnsdist), jid 0, uid 208: exited on
signal 11 (no core dump - bad address)
Jan 13 22:40:27 kernel: pid 48868 (dnsdist), jid 0, uid 208: exited on
signal 11 (no core dump - bad address)
Jan 14 05:45:15 kernel: pid 36766 (dnsdist), jid 0, uid 208: exited on
signal 11 (no core dump - bad address)
**)
Even though the dnsdist update happened already on 2025-01-02, this
server only got actual user queries starting on 2025-01-12 because it is
a failover server only.
So on this server the crash happened within 12 hours of actual user
queries on version 1.9.8.
This server always had two recursors in the dnsdist config to send
queries to and both were up.
best regards,
Christoph
More information about the Pdns-users
mailing list