[dnsdist] dnsdist firstAvailable order - apparent bug
Frank Even
lists+powerdns.com at elitists.org
Wed Nov 22 23:05:42 UTC 2017
Seems despite sending as text, alignment is an issue. If needing a
better view of the data, I've tossed it in a gist as well:
https://gist.github.com/dfjkl/1b45f83f8b0fd427191a8d63a0e6aaa5
On Wed, Nov 22, 2017 at 3:51 PM, Frank Even
<lists+powerdns.com at elitists.org> wrote:
> To Whomever May Be Concerned,
>
> In testing dnsdist (version 1.2.0) out on a new system configured with
> the ServerPolicy(firstAvailable), we noticed what seems like a pretty
> big bug. We've got a lot of nodes servicing anycast addresses,
> converting from named listening on those addresses to just listening
> on the local addresses and then letting dnsdist handle listening on
> the anycast addresses. In this case, we've got a group of 24 servers
> configured as backends to dnsdist in geographically diverse areas in
> an ordered config serving DNS requests from
> localhost/localcluster/remote systems. On a local node, my test was
> running a "dig +short @anycastaddr google.com" in a loop. What we end
> up seeing is that when we kill named on the local system, queries jump
> to the last system in the ordered list. It does not matter what
> system is there or how latent it is (we tried changing up the
> configuration to different systems), or the order number configured
> (these were tested at 100, 90, and now 9 just to ensure it wasn't an
> error in sorting numbers). IF we set the last system in the list to
> administratively DOWN, then the ordering works as expected. When the
> final server in the list is put back in service, queries jump back to
> the very last system in the list until the local named instance is
> brought back up and then queries return there. Some data below
> demonstrating this:
>
> # Queries going to localhost, first host in the ordered list.
>
>> showServers()
> # Name Address State Qps
> Qlim Ord Wt Queries Drops Drate Lat Outstanding Pools
> 0 127.0.0.1:53 up 1.0
> 0 0 1 68 0 0.0 0.6 0
> 1 10.3.5.13:53 up 0.0
> 0 5 1 0 0 0.0 0.0 0
> 2 10.3.5.14:53 up 0.0
> 0 5 1 0 0 0.0 0.0 0
> 3 10.6.3.1:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 4 10.6.3.2:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 5 10.6.3.3:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 6 10.6.3.65:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 7 10.6.3.66:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 8 10.6.3.67:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 9 10.3.8.27:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 10 10.3.8.47:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 11 10.2.7.15:53 down 0.0
> 0 9 1 0 0 0.0 0.0 0
> 12 10.2.7.16:53 down 0.0
> 0 9 1 0 0 0.0 0.0 0
> 13 10.2.7.17:53 down 0.0
> 0 9 1 0 0 0.0 0.0 0
> 14 10.2.7.18:53 down 0.0
> 0 9 1 0 0 0.0 0.0 0
> 15 10.2.7.19:53 down 0.0
> 0 9 1 0 0 0.0 0.0 0
> 16 10.2.7.20:53 down 0.0
> 0 9 1 0 0 0.0 0.0 0
> 17 10.8.3.2:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 18 10.8.3.65:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 19 10.8.3.66:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 20 10.4.3.1:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 21 10.4.3.2:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 22 10.4.3.66:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 23 10.4.3.65:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 24 10.8.3.1:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> All 0.0
> 68 0
>
> # Dropping local named instance
>
> ~]# service named stop ; dnsdist -c
> Redirecting to /bin/systemctl stop named.service
>> showServers()
> # Name Address State Qps
> Qlim Ord Wt Queries Drops Drate Lat Outstanding Pools
> 0 127.0.0.1:53 up 1.1
> 0 0 1 106 0 0.0 0.5 1
> 1 10.3.5.13:53 up 0.0
> 0 5 1 0 0 0.0 0.0 0
> 2 10.3.5.14:53 up 0.0
> 0 5 1 0 0 0.0 0.0 0
> 3 10.6.3.1:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> <snip>
> 22 10.4.3.66:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 23 10.4.3.65:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 24 10.8.3.1:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> All 1.0
> 106 0
>
> # dnsdist drops localhost and local system IP for this system out of rotation.
> # NOTE - queries are now diverted to the last node in the list. This
> system is ordered higher than node 1, still up and receiving
> # requests happily. It's also of course less latent since it's one
> hop away. Yet, we're crossing an ocean here for resolution.
>
>> showServers()
> # Name Address State Qps
> Qlim Ord Wt Queries Drops Drate Lat Outstanding Pools
> 0 127.0.0.1:53 down 0.0
> 0 0 1 107 2 0.0 0.5 0
> 1 10.3.5.13:53 up 0.0
> 0 5 1 0 0 0.0 0.0 0
> 2 10.3.5.14:53 down 0.0
> 0 5 1 0 0 0.0 0.0 0
> 3 10.6.3.1:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> <snip>
> 22 10.4.3.66:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 23 10.4.3.65:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 24 10.8.3.1:53 up 0.8
> 0 9 1 20 0 0.0 24.7 0
> All 0.0
> 127 2
>
> # Forcing down the last server in the list
>
>> getServer(24):setDown()
>> showServers()
> # Name Address State Qps
> Qlim Ord Wt Queries Drops Drate Lat Outstanding Pools
> 0 127.0.0.1:53 down 0.0
> 0 0 1 107 2 0.0 0.5 0
> 1 10.3.5.13:53 up 0.0
> 0 5 1 2 0 0.0 0.0 0
> 2 10.3.5.14:53 down 0.0
> 0 5 1 0 0 0.0 0.0 0
> 3 10.6.3.1:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> <snip>
> 22 10.4.3.66:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 23 10.4.3.65:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 24 10.8.3.1:53 DOWN 0.8
> 0 9 1 34 0 0.0 39.8 0
> All 0.0
> 143 2
>
> # Traffic shifts to the next lowest ordered system (#1), as it should.
>
>> showServers()
> # Name Address State Qps
> Qlim Ord Wt Queries Drops Drate Lat Outstanding Pools
> 0 127.0.0.1:53 down 0.0
> 0 0 1 107 2 0.0 0.5 0
> 1 10.3.5.13:53 up 1.0
> 0 5 1 71 0 0.0 0.6 0
> 2 10.3.5.14:53 down 0.0
> 0 5 1 0 0 0.0 0.0 0
> 3 10.6.3.1:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> <snip>
> 22 10.4.3.66:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 23 10.4.3.65:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 24 10.8.3.1:53 DOWN 0.0
> 0 9 1 34 0 0.0 39.8 0
> All 0.0
> 212 2
>
> # Putting last system in list (#24) back in active state, and dnsdist
> starts sending traffic to it again?!
>
>> getServer(24):setAuto()
>> showServers()
> # Name Address State Qps
> Qlim Ord Wt Queries Drops Drate Lat Outstanding Pools
> 0 127.0.0.1:53 down 0.0
> 0 0 1 107 2 0.0 0.5 0
> 1 10.3.5.13:53 up 1.1
> 0 5 1 86 0 0.0 0.5 0
> 2 10.3.5.14:53 down 0.0
> 0 5 1 0 0 0.0 0.0 0
> 3 10.6.3.1:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> <snip>
> 22 10.4.3.66:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 23 10.4.3.65:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 24 10.8.3.1:53 up 0.0
> 0 9 1 36 0 0.0 41.8 0
> All 1.0
> 229 2
>
> # ...and traffic keeps getting sent to it, despite high latency and
> higher numerical order in the active systems list.
>
>> showServers()
> # Name Address State Qps
> Qlim Ord Wt Queries Drops Drate Lat Outstanding Pools
> 0 127.0.0.1:53 down 0.0
> 0 0 1 107 2 0.0 0.5 0
> 1 10.3.5.13:53 up 0.0
> 0 5 1 86 0 0.0 0.5 0
> 2 10.3.5.14:53 down 0.0
> 0 5 1 0 0 0.0 0.0 0
> 3 10.6.3.1:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> <snip>
> 24 10.8.3.1:53 up 0.8
> 0 9 1 172 0 0.0 125.9 0
> All 0.0
> 365 2
>
> # If I add a dummy entry at the end of the list w/ a higher priority,
> things work as they're supposed to (although, I'm not
> # completely convinced it's taking latency in consideration when it
> fails over to all the same weighted systems, it seems to
> # jump towards the end of that list regardless of latency).
>
>> showServers()
> # Name Address State Qps
> Qlim Ord Wt Queries Drops Drate Lat Outstanding Pools
> 0 127.0.0.1:53 down 0.0
> 0 0 1 35 1 0.0 1.4 0
> 1 10.3.5.13:53 down 0.0
> 0 5 1 31 2 0.0 0.6 0
> 2 10.3.5.14:53 down 0.0
> 0 5 1 0 0 0.0 0.0 0
> 3 10.6.3.1:53 DOWN 0.0
> 0 6 1 32 0 0.0 0.4 0
> 4 10.3.8.27:53 up 1.0
> 0 7 1 239 0 0.0 0.4 0
> <snip>
> 21 10.4.3.2:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 22 10.4.3.66:53 up 0.0
> 0 9 1 0 0 0.0 0.0 0
> 23 10.4.3.65:53 up 0.0
> 0 9 1 431 0 0.0 148.6 0
> 24 10.8.3.1:53 up 0.0
> 0 19 1 0 0 0.0 0.0 0
> 25 127.0.0.255:53 down 0.0
> 0 99 1 0 0 0.0 0.0 0
> All 0.0
> 768 3
More information about the dnsdist
mailing list