[dnsdist] dnsdist firstAvailable order - apparent bug

Remi Gacogne remi.gacogne at powerdns.com
Mon Dec 4 11:30:27 UTC 2017


Hi Frank,

On 11/23/2017 12:05 AM, Frank Even wrote:
> Seems despite sending as text, alignment is an issue.  If needing a
> better view of the data, I've tossed it in a gist as well:
> https://gist.github.com/dfjkl/1b45f83f8b0fd427191a8d63a0e6aaa5

Thank you for your detailed report! This is indeed a bug, the last added
server was always sorted as if it had an order of 0. I just opened a new
pull request [1] to fix this issue, it should be fixed in master soon.
We provide packages at [2] if you want to give master a try once the
issue has been fixed.

[1]: https://github.com/PowerDNS/pdns/pull/6043
[2]: https://repo.powerdns.com

Best regards,

Remi

> On Wed, Nov 22, 2017 at 3:51 PM, Frank Even
> <lists+powerdns.com at elitists.org> wrote:
>> To Whomever May Be Concerned,
>>
>> In testing dnsdist (version 1.2.0) out on a new system configured with
>> the ServerPolicy(firstAvailable), we noticed what seems like a pretty
>> big bug.  We've got a lot of nodes servicing anycast addresses,
>> converting from named listening on those addresses to just listening
>> on the local addresses and then letting dnsdist handle listening on
>> the anycast addresses.  In this case, we've got a group of 24 servers
>> configured as backends to dnsdist in geographically diverse areas in
>> an ordered config serving DNS requests from
>> localhost/localcluster/remote systems.  On a local node, my test was
>> running a "dig +short @anycastaddr google.com" in a loop.  What we end
>> up seeing is that when we kill named on the local system, queries jump
>> to the last system in the ordered list.  It does not matter what
>> system is there or how latent it is (we tried changing up the
>> configuration to different systems), or the order number configured
>> (these were tested at 100, 90, and now 9 just to ensure it wasn't an
>> error in sorting numbers).  IF we set the last system in the list to
>> administratively DOWN, then the ordering works as expected.  When the
>> final server in the list is put back in service, queries jump back to
>> the very last system in the list until the local named instance is
>> brought back up and then queries return there.  Some data below
>> demonstrating this:
>>
>> # Queries going to localhost, first host in the ordered list.
>>
>>> showServers()
>> #   Name                 Address                       State     Qps
>>  Qlim Ord Wt    Queries   Drops Drate   Lat Outstanding Pools
>> 0                        127.0.0.1:53                     up     1.0
>>     0   0  1         68       0   0.0   0.6           0
>> 1                        10.3.5.13:53                     up     0.0
>>     0   5  1          0       0   0.0   0.0           0
>> 2                        10.3.5.14:53                     up     0.0
>>     0   5  1          0       0   0.0   0.0           0
>> 3                        10.6.3.1:53                      up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 4                        10.6.3.2:53                      up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 5                        10.6.3.3:53                      up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 6                        10.6.3.65:53                     up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 7                        10.6.3.66:53                     up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 8                        10.6.3.67:53                     up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 9                        10.3.8.27:53                     up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 10                       10.3.8.47:53                     up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 11                       10.2.7.15:53                   down     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 12                       10.2.7.16:53                   down     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 13                       10.2.7.17:53                   down     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 14                       10.2.7.18:53                   down     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 15                       10.2.7.19:53                   down     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 16                       10.2.7.20:53                   down     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 17                       10.8.3.2:53                      up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 18                       10.8.3.65:53                     up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 19                       10.8.3.66:53                     up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 20                       10.4.3.1:53                      up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 21                       10.4.3.2:53                      up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 22                       10.4.3.66:53                     up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 23                       10.4.3.65:53                     up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 24                       10.8.3.1:53                      up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> All                                                              0.0
>>                      68       0
>>
>> # Dropping local named instance
>>
>> ~]# service named stop ; dnsdist -c
>> Redirecting to /bin/systemctl stop named.service
>>> showServers()
>> #   Name                 Address                       State     Qps
>>  Qlim Ord Wt    Queries   Drops Drate   Lat Outstanding Pools
>> 0                        127.0.0.1:53                     up     1.1
>>     0   0  1        106       0   0.0   0.5           1
>> 1                        10.3.5.13:53                     up     0.0
>>     0   5  1          0       0   0.0   0.0           0
>> 2                        10.3.5.14:53                     up     0.0
>>     0   5  1          0       0   0.0   0.0           0
>> 3                        10.6.3.1:53                      up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> <snip>
>> 22                       10.4.3.66:53                     up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 23                       10.4.3.65:53                     up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 24                       10.8.3.1:53                      up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> All                                                              1.0
>>                     106       0
>>
>> # dnsdist drops localhost and local system IP for this system out of rotation.
>> # NOTE - queries are now diverted to the last node in the list.  This
>> system is ordered higher than node 1, still up and receiving
>> # requests happily.  It's also of course less latent since it's one
>> hop away.  Yet, we're crossing an ocean here for resolution.
>>
>>> showServers()
>> #   Name                 Address                       State     Qps
>>  Qlim Ord Wt    Queries   Drops Drate   Lat Outstanding Pools
>> 0                        127.0.0.1:53                   down     0.0
>>     0   0  1        107       2   0.0   0.5           0
>> 1                        10.3.5.13:53                     up     0.0
>>     0   5  1          0       0   0.0   0.0           0
>> 2                        10.3.5.14:53                   down     0.0
>>     0   5  1          0       0   0.0   0.0           0
>> 3                        10.6.3.1:53                      up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> <snip>
>> 22                       10.4.3.66:53                     up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 23                       10.4.3.65:53                     up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 24                       10.8.3.1:53                      up     0.8
>>     0   9  1         20       0   0.0  24.7           0
>> All                                                              0.0
>>                     127       2
>>
>> # Forcing down the last server in the list
>>
>>> getServer(24):setDown()
>>> showServers()
>> #   Name                 Address                       State     Qps
>>  Qlim Ord Wt    Queries   Drops Drate   Lat Outstanding Pools
>> 0                        127.0.0.1:53                   down     0.0
>>     0   0  1        107       2   0.0   0.5           0
>> 1                        10.3.5.13:53                     up     0.0
>>     0   5  1          2       0   0.0   0.0           0
>> 2                        10.3.5.14:53                   down     0.0
>>     0   5  1          0       0   0.0   0.0           0
>> 3                        10.6.3.1:53                      up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> <snip>
>> 22                       10.4.3.66:53                     up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 23                       10.4.3.65:53                     up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 24                       10.8.3.1:53                    DOWN     0.8
>>     0   9  1         34       0   0.0  39.8           0
>> All                                                              0.0
>>                     143       2
>>
>> # Traffic shifts to the next lowest ordered system (#1), as it should.
>>
>>> showServers()
>> #   Name                 Address                       State     Qps
>>  Qlim Ord Wt    Queries   Drops Drate   Lat Outstanding Pools
>> 0                        127.0.0.1:53                   down     0.0
>>     0   0  1        107       2   0.0   0.5           0
>> 1                        10.3.5.13:53                     up     1.0
>>     0   5  1         71       0   0.0   0.6           0
>> 2                        10.3.5.14:53                   down     0.0
>>     0   5  1          0       0   0.0   0.0           0
>> 3                        10.6.3.1:53                      up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> <snip>
>> 22                       10.4.3.66:53                     up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 23                       10.4.3.65:53                     up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 24                       10.8.3.1:53                    DOWN     0.0
>>     0   9  1         34       0   0.0  39.8           0
>> All                                                              0.0
>>                     212       2
>>
>> # Putting last system in list (#24) back in active state, and dnsdist
>> starts sending traffic to it again?!
>>
>>> getServer(24):setAuto()
>>> showServers()
>> #   Name                 Address                       State     Qps
>>  Qlim Ord Wt    Queries   Drops Drate   Lat Outstanding Pools
>> 0                        127.0.0.1:53                   down     0.0
>>     0   0  1        107       2   0.0   0.5           0
>> 1                        10.3.5.13:53                     up     1.1
>>     0   5  1         86       0   0.0   0.5           0
>> 2                        10.3.5.14:53                   down     0.0
>>     0   5  1          0       0   0.0   0.0           0
>> 3                        10.6.3.1:53                      up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> <snip>
>> 22                       10.4.3.66:53                     up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 23                       10.4.3.65:53                     up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 24                       10.8.3.1:53                      up     0.0
>>     0   9  1         36       0   0.0  41.8           0
>> All                                                              1.0
>>                     229       2
>>
>> # ...and traffic keeps getting sent to it, despite high latency and
>> higher numerical order in the active systems list.
>>
>>> showServers()
>> #   Name                 Address                       State     Qps
>>  Qlim Ord Wt    Queries   Drops Drate   Lat Outstanding Pools
>> 0                        127.0.0.1:53                   down     0.0
>>     0   0  1        107       2   0.0   0.5           0
>> 1                        10.3.5.13:53                     up     0.0
>>     0   5  1         86       0   0.0   0.5           0
>> 2                        10.3.5.14:53                   down     0.0
>>     0   5  1          0       0   0.0   0.0           0
>> 3                        10.6.3.1:53                      up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> <snip>
>> 24                       10.8.3.1:53                      up     0.8
>>     0   9  1        172       0   0.0 125.9           0
>> All                                                              0.0
>>                     365       2
>>
>> # If I add a dummy entry at the end of the list w/ a higher priority,
>> things work as they're supposed to (although, I'm not
>> # completely convinced it's taking latency in consideration when it
>> fails over to all the same weighted systems, it seems to
>> # jump towards the end of that list regardless of latency).
>>
>>> showServers()
>> #   Name                 Address                       State     Qps
>>  Qlim Ord Wt    Queries   Drops Drate   Lat Outstanding Pools
>> 0                        127.0.0.1:53                   down     0.0
>>     0   0  1         35       1   0.0   1.4           0
>> 1                        10.3.5.13:53                   down     0.0
>>     0   5  1         31       2   0.0   0.6           0
>> 2                        10.3.5.14:53                   down     0.0
>>     0   5  1          0       0   0.0   0.0           0
>> 3                        10.6.3.1:53                    DOWN     0.0
>>     0   6  1         32       0   0.0   0.4           0
>> 4                        10.3.8.27:53                     up     1.0
>>     0   7  1        239       0   0.0   0.4           0
>> <snip>
>> 21                       10.4.3.2:53                      up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 22                       10.4.3.66:53                     up     0.0
>>     0   9  1          0       0   0.0   0.0           0
>> 23                       10.4.3.65:53                     up     0.0
>>     0   9  1        431       0   0.0 148.6           0
>> 24                       10.8.3.1:53                      up     0.0
>>     0  19  1          0       0   0.0   0.0           0
>> 25                       127.0.0.255:53                 down     0.0
>>     0  99  1          0       0   0.0   0.0           0
>> All                                                              0.0
>>                     768       3
> _______________________________________________
> dnsdist mailing list
> dnsdist at mailman.powerdns.com
> https://mailman.powerdns.com/mailman/listinfo/dnsdist
> 


-- 
Remi Gacogne
PowerDNS.COM BV - https://www.powerdns.com/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://mailman.powerdns.com/pipermail/dnsdist/attachments/20171204/4f4864c9/attachment.sig>


More information about the dnsdist mailing list