[dnsdist] Latency based routing

Thu Jun 26 11:02:23 UTC 2025

Hello,

I have two questions for dnsdist users here

dnsdist by default uses leastOutstanding load balancing policy which in certain cases takes the lowest measured latency into account based on the last 128 queries answered by the downstream

My first (somewhat simple) question is, is there a way to make health checks count towards the latency measurements, currently it doesn't seem to take the health check queries into account in the latency metric. While I understand not everyone may want this, I wonder if there's some way (even if custom Lua) to make that happen.

My second question, is more about a custom policy in Lua
Since latency based load balancing isn't currently a thing, this can be implemented into Lua, so that the selected downstream server will be the lowest latency (online) server.

This can be done by looping over the servers available, checking if the server is up using :isUp() and then using the :getLatency() to figure out the latency, this works great most of the time, however:
1: If dnsdist restarts, the latency across all nodes will be super low, because it seems to use a fixed size list, where every "empty" value is `0`. As a result when the average is calculated across 128 values (many of which are zero initially), this may cause some weird routing.

I wonder if there's a way to get (currently in Lua) the number of downstream queries (e.g. as exposed in `showServers()` for each individual server. I see there's a :getDrops() method available, but seemingly no :getQueries() - is there another way we can somehow get these, while still being fast enough to execute on every upstream query (when the load balancing takes place).

2: A bit related to the first question, if we then decide to select the lowest latency server, because the other downstreams no longer get queries, we also don't get updated latency metrics, as you know sometimes routing on the interwebs change, and this may affect the latency. Thus if we could e.g. take the health checking measurements into account, this would at the same time be resolved, since we'd always have fresh data effectively.
The alternative one has to do currently from my understanding is to take random queries (e.g. 1-5% of misses) and send them to a random upstream to continue the latency measurements to take place

The use case here is that there's a bunch of distributed dnsdist servers, and behind those there's a bunch of downstreams (which are located in various continents), in the case of a cache miss in dnsdist, I'd like send the query to a downstream that's as close as possible to the dnsdist system as possible (in case it's up).

Thanks in advance for any pointers or ideas shared!

Best Regards,
Lucas Rolff