[dnsdist] Latency based routing

Thu Jun 26 12:25:35 UTC 2025

Hello Remi,

Thanks for your reply!

> Not without patching the code. It would only take a few lines of C++ to make it available, though.

I'll take a look at the code, and see if it's something I could do myself and submit a PR for it, without looking, I'd assume it's effectively the same as the showDrops() one just using a different counter

> The reason I'm not very fond of this idea is that health-check queries are often not representative of actual traffic, and would thus skew the latency metrics in many cases. But I get your point, while most deployments get a lot of traffic and therefore don't really care about the short time it takes to get useful metrics, it might be different for low-traffic deployments or for backup servers. Do you think it might work if dnsdist were to update the latency from health-check queries if, and only if, there was no "regular" query processed by the server in a fixed interval (let's say 60 seconds? I have not really thought about it). The first health-check query would then of course automatically update the latency unless a "regular" query was processed before the health-check succeeded.

I think this could work, obviously for high(er) traffic environments, you'd likely still see much more downstream traffic, thus you'd have the available data, but I could see it being beneficial where e.g. you'd "sample" the health checks to be a part of the measurement (even if that would mean setting some special flag). e.g. if we do health checks every second, it could be 1 in 20 checks that would count towards the latency measurements, this way there's still the periodic checking for somewhat idle downstreams

Best Regards,
Lucas Rolff

> On 26 Jun 2025, at 14:18, Remi Gacogne via dnsdist <dnsdist at mailman.powerdns.com> wrote:
> 
> Hi Lucas,
> 
> On 6/26/25 13:02, Lucas Rolff via dnsdist wrote:
>> dnsdist by default uses leastOutstanding load balancing policy which in certain cases takes the lowest measured latency into account based on the last 128 queries answered by the downstream
> 
> Correct, if several servers have the same number of outstanding queries their latency is used to break the tie.
> 
>> My first (somewhat simple) question is, is there a way to make health checks count towards the latency measurements, currently it doesn't seem to take the health check queries into account in the latency metric. While I understand not everyone may want this, I wonder if there's some way (even if custom Lua) to make that happen.
> 
> Not without patching the code, I'm afraid.
>> My second question, is more about a custom policy in Lua
>> Since latency based load balancing isn't currently a thing, this can be implemented into Lua, so that the selected downstream server will be the lowest latency (online) server.
>> This can be done by looping over the servers available, checking if the server is up using :isUp() and then using the :getLatency() to figure out the latency, this works great most of the time, however:
>> 1: If dnsdist restarts, the latency across all nodes will be super low, because it seems to use a fixed size list, where every "empty" value is `0`. As a result when the average is calculated across 128 values (many of which are zero initially), this may cause some weird routing.
> 
> True, it takes a few queries for the value to become useful.
> 
>> I wonder if there's a way to get (currently in Lua) the number of downstream queries (e.g. as exposed in `showServers()` for each individual server. I see there's a :getDrops() method available, but seemingly no :getQueries() - is there another way we can somehow get these, while still being fast enough to execute on every upstream query (when the load balancing takes place).
> 
> Not without patching the code. It would only take a few lines of C++ to make it available, though.
>> 2: A bit related to the first question, if we then decide to select the lowest latency server, because the other downstreams no longer get queries, we also don't get updated latency metrics, as you know sometimes routing on the interwebs change, and this may affect the latency. Thus if we could e.g. take the health checking measurements into account, this would at the same time be resolved, since we'd always have fresh data effectively.
> 
> The reason I'm not very fond of this idea is that health-check queries are often not representative of actual traffic, and would thus skew the latency metrics in many cases. But I get your point, while most deployments get a lot of traffic and therefore don't really care about the short time it takes to get useful metrics, it might be different for low-traffic deployments or for backup servers. Do you think it might work if dnsdist were to update the latency from health-check queries if, and only if, there was no "regular" query processed by the server in a fixed interval (let's say 60 seconds? I have not really thought about it). The first health-check query would then of course automatically update the latency unless a "regular" query was processed before the health-check succeeded.
> 
> Best regards,
> -- 
> Remi Gacogne
> PowerDNS.COM BV - https://www.powerdns.com/
> _______________________________________________
> dnsdist mailing list
> dnsdist at mailman.powerdns.com
> https://mailman.powerdns.com/mailman/listinfo/dnsdist