[dnsdist] dnsdist Drops, revisited

Fri Mar 6 15:17:08 UTC 2020

Hi Remi,

Thanks for your clarifications (see inline below)

> On 6 Mar 2020, at 14:26, Remi Gacogne via dnsdist <dnsdist at mailman.powerdns.com> wrote:
> 
> Signed PGP part
> Hi,
> 
> On 3/6/20 8:09 AM, Fredrik Pettai via dnsdist wrote:
>>> On 6 Mar 2020, at 05:42, Michael Van Der Beek <michael.van at antlabs.com> wrote:
>>> Have you noticed this setting on dnsdist.
>>> setUDPTimeout(num)
>> 
>> Yes, I did, but I didn’t play around with that before I sent the email to the mailing list
>> 
>>> Set the maximum time dnsdist will wait for a response from a backend over UDP, in seconds. Defaults to 2
>>> I'm not sure if timeouts are classified as drops. My guess probably, because it didn't get a response in time.
>> 
>> Yes they are.
> 
> "Drops", as reported by dnsdist, are almost always cause by the backend
> not responding fast enough. On some setups, dealing with 100k+ qps, it
> might also be caused by dnsdist not processing the responses fast
> enough, but that's very easy to spot because at least one of the dnsdist
> threads will use ~100% of one core.
> 
>>> Since your backend is a recursor. There are times that the recursor cannot reach or encounters a non-responsive authoritative server.  Unbound has an exponential backoff when querying such servers. I think it starts with 10s.
>>> https://nlnetlabs.nl/documentation/unbound/info-timeout/
>>> 
>>> I would suggest you set the dnsdist setUDPTImeout(10), frankly, if Unbound cannot respond to you in < 10 seconds, most likely the target authoritative server is not responding.
>> 
>> Good point, while I didn’t turn to the unbound documentation (thanks for the pointer) I played around with the UDPTimeout setting yesterday,
>> first increasing to setUDPTImeout(5), which yielded better results in terms of Drops (and increased the latency) and then later to 15, just to be sure that unbound really should be done with queries, and noticed that the Drops became a lot less (and latency increase again). But as you suggest, setUDPTImeout(10) is probably the ultimate setting.
> 
> OK so that settles it, your backends are not responding fast enough to
> some queries. I would really advise you to try to understand why the
> backend is taking so long to respond, instead of tuning dnsdist via
> setUDPTImeout(), because a latency greater than 2s is going to cause a
> lot of issues anyway.

Right, in this case the #1 reason for those queries that don’t make it under 2s, are queries that some MX servers & software on those generates
A lot of crappy stuff out on the Internet are in contact with those servers/services, so broken reverse zones or badly setup domains that spams are what I see in topSlow() all the time.

This brings back one of the (last) questions in my original email, which was;
Is there a simple way to move those long tail queries / DNS clients into a “slow pool"?
Or maybe I should rephrase it to;
From a dnsdist PoW; would it be a good idea to move away clients that ask lots of questions about badly functioning domains, to their own worker pool?

I don’t seem to find any ready-to-use Rule/Action for applying clients that are causing X amount of SERVFAILs (or Timeouts) to a PoolAction.
(Although, I see there's a possibility to block clients with such query pattern (SERVFAIL/s), but that’s not the right solution or service in this case.)
(I’m guessing “anything can be done” with some clever Lua scripting, but that’s not really same as “simple")

I thought of using a NMG for statically map such client’s (the MX servers) into their own worker pool, but I didn’t get that to work :(
(perhaps I did it wrong or I misinterpret the function of a NMG)

Re,
Fredrik
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: Message signed with OpenPGP
URL: <http://mailman.powerdns.com/pipermail/dnsdist/attachments/20200306/4991718d/attachment.sig>