[dnsdist] dnsdist firstAvailable order - apparent bug

Frank Even lists+powerdns.com at elitists.org
Wed Nov 22 22:51:58 UTC 2017


To Whomever May Be Concerned,

In testing dnsdist (version 1.2.0) out on a new system configured with
the ServerPolicy(firstAvailable), we noticed what seems like a pretty
big bug.  We've got a lot of nodes servicing anycast addresses,
converting from named listening on those addresses to just listening
on the local addresses and then letting dnsdist handle listening on
the anycast addresses.  In this case, we've got a group of 24 servers
configured as backends to dnsdist in geographically diverse areas in
an ordered config serving DNS requests from
localhost/localcluster/remote systems.  On a local node, my test was
running a "dig +short @anycastaddr google.com" in a loop.  What we end
up seeing is that when we kill named on the local system, queries jump
to the last system in the ordered list.  It does not matter what
system is there or how latent it is (we tried changing up the
configuration to different systems), or the order number configured
(these were tested at 100, 90, and now 9 just to ensure it wasn't an
error in sorting numbers).  IF we set the last system in the list to
administratively DOWN, then the ordering works as expected.  When the
final server in the list is put back in service, queries jump back to
the very last system in the list until the local named instance is
brought back up and then queries return there.  Some data below
demonstrating this:

# Queries going to localhost, first host in the ordered list.

> showServers()
#   Name                 Address                       State     Qps
 Qlim Ord Wt    Queries   Drops Drate   Lat Outstanding Pools
0                        127.0.0.1:53                     up     1.0
    0   0  1         68       0   0.0   0.6           0
1                        10.3.5.13:53                     up     0.0
    0   5  1          0       0   0.0   0.0           0
2                        10.3.5.14:53                     up     0.0
    0   5  1          0       0   0.0   0.0           0
3                        10.6.3.1:53                      up     0.0
    0   9  1          0       0   0.0   0.0           0
4                        10.6.3.2:53                      up     0.0
    0   9  1          0       0   0.0   0.0           0
5                        10.6.3.3:53                      up     0.0
    0   9  1          0       0   0.0   0.0           0
6                        10.6.3.65:53                     up     0.0
    0   9  1          0       0   0.0   0.0           0
7                        10.6.3.66:53                     up     0.0
    0   9  1          0       0   0.0   0.0           0
8                        10.6.3.67:53                     up     0.0
    0   9  1          0       0   0.0   0.0           0
9                        10.3.8.27:53                     up     0.0
    0   9  1          0       0   0.0   0.0           0
10                       10.3.8.47:53                     up     0.0
    0   9  1          0       0   0.0   0.0           0
11                       10.2.7.15:53                   down     0.0
    0   9  1          0       0   0.0   0.0           0
12                       10.2.7.16:53                   down     0.0
    0   9  1          0       0   0.0   0.0           0
13                       10.2.7.17:53                   down     0.0
    0   9  1          0       0   0.0   0.0           0
14                       10.2.7.18:53                   down     0.0
    0   9  1          0       0   0.0   0.0           0
15                       10.2.7.19:53                   down     0.0
    0   9  1          0       0   0.0   0.0           0
16                       10.2.7.20:53                   down     0.0
    0   9  1          0       0   0.0   0.0           0
17                       10.8.3.2:53                      up     0.0
    0   9  1          0       0   0.0   0.0           0
18                       10.8.3.65:53                     up     0.0
    0   9  1          0       0   0.0   0.0           0
19                       10.8.3.66:53                     up     0.0
    0   9  1          0       0   0.0   0.0           0
20                       10.4.3.1:53                      up     0.0
    0   9  1          0       0   0.0   0.0           0
21                       10.4.3.2:53                      up     0.0
    0   9  1          0       0   0.0   0.0           0
22                       10.4.3.66:53                     up     0.0
    0   9  1          0       0   0.0   0.0           0
23                       10.4.3.65:53                     up     0.0
    0   9  1          0       0   0.0   0.0           0
24                       10.8.3.1:53                      up     0.0
    0   9  1          0       0   0.0   0.0           0
All                                                              0.0
                     68       0

# Dropping local named instance

~]# service named stop ; dnsdist -c
Redirecting to /bin/systemctl stop named.service
> showServers()
#   Name                 Address                       State     Qps
 Qlim Ord Wt    Queries   Drops Drate   Lat Outstanding Pools
0                        127.0.0.1:53                     up     1.1
    0   0  1        106       0   0.0   0.5           1
1                        10.3.5.13:53                     up     0.0
    0   5  1          0       0   0.0   0.0           0
2                        10.3.5.14:53                     up     0.0
    0   5  1          0       0   0.0   0.0           0
3                        10.6.3.1:53                      up     0.0
    0   9  1          0       0   0.0   0.0           0
<snip>
22                       10.4.3.66:53                     up     0.0
    0   9  1          0       0   0.0   0.0           0
23                       10.4.3.65:53                     up     0.0
    0   9  1          0       0   0.0   0.0           0
24                       10.8.3.1:53                      up     0.0
    0   9  1          0       0   0.0   0.0           0
All                                                              1.0
                    106       0

# dnsdist drops localhost and local system IP for this system out of rotation.
# NOTE - queries are now diverted to the last node in the list.  This
system is ordered higher than node 1, still up and receiving
# requests happily.  It's also of course less latent since it's one
hop away.  Yet, we're crossing an ocean here for resolution.

> showServers()
#   Name                 Address                       State     Qps
 Qlim Ord Wt    Queries   Drops Drate   Lat Outstanding Pools
0                        127.0.0.1:53                   down     0.0
    0   0  1        107       2   0.0   0.5           0
1                        10.3.5.13:53                     up     0.0
    0   5  1          0       0   0.0   0.0           0
2                        10.3.5.14:53                   down     0.0
    0   5  1          0       0   0.0   0.0           0
3                        10.6.3.1:53                      up     0.0
    0   9  1          0       0   0.0   0.0           0
<snip>
22                       10.4.3.66:53                     up     0.0
    0   9  1          0       0   0.0   0.0           0
23                       10.4.3.65:53                     up     0.0
    0   9  1          0       0   0.0   0.0           0
24                       10.8.3.1:53                      up     0.8
    0   9  1         20       0   0.0  24.7           0
All                                                              0.0
                    127       2

# Forcing down the last server in the list

> getServer(24):setDown()
> showServers()
#   Name                 Address                       State     Qps
 Qlim Ord Wt    Queries   Drops Drate   Lat Outstanding Pools
0                        127.0.0.1:53                   down     0.0
    0   0  1        107       2   0.0   0.5           0
1                        10.3.5.13:53                     up     0.0
    0   5  1          2       0   0.0   0.0           0
2                        10.3.5.14:53                   down     0.0
    0   5  1          0       0   0.0   0.0           0
3                        10.6.3.1:53                      up     0.0
    0   9  1          0       0   0.0   0.0           0
<snip>
22                       10.4.3.66:53                     up     0.0
    0   9  1          0       0   0.0   0.0           0
23                       10.4.3.65:53                     up     0.0
    0   9  1          0       0   0.0   0.0           0
24                       10.8.3.1:53                    DOWN     0.8
    0   9  1         34       0   0.0  39.8           0
All                                                              0.0
                    143       2

# Traffic shifts to the next lowest ordered system (#1), as it should.

> showServers()
#   Name                 Address                       State     Qps
 Qlim Ord Wt    Queries   Drops Drate   Lat Outstanding Pools
0                        127.0.0.1:53                   down     0.0
    0   0  1        107       2   0.0   0.5           0
1                        10.3.5.13:53                     up     1.0
    0   5  1         71       0   0.0   0.6           0
2                        10.3.5.14:53                   down     0.0
    0   5  1          0       0   0.0   0.0           0
3                        10.6.3.1:53                      up     0.0
    0   9  1          0       0   0.0   0.0           0
<snip>
22                       10.4.3.66:53                     up     0.0
    0   9  1          0       0   0.0   0.0           0
23                       10.4.3.65:53                     up     0.0
    0   9  1          0       0   0.0   0.0           0
24                       10.8.3.1:53                    DOWN     0.0
    0   9  1         34       0   0.0  39.8           0
All                                                              0.0
                    212       2

# Putting last system in list (#24) back in active state, and dnsdist
starts sending traffic to it again?!

> getServer(24):setAuto()
> showServers()
#   Name                 Address                       State     Qps
 Qlim Ord Wt    Queries   Drops Drate   Lat Outstanding Pools
0                        127.0.0.1:53                   down     0.0
    0   0  1        107       2   0.0   0.5           0
1                        10.3.5.13:53                     up     1.1
    0   5  1         86       0   0.0   0.5           0
2                        10.3.5.14:53                   down     0.0
    0   5  1          0       0   0.0   0.0           0
3                        10.6.3.1:53                      up     0.0
    0   9  1          0       0   0.0   0.0           0
<snip>
22                       10.4.3.66:53                     up     0.0
    0   9  1          0       0   0.0   0.0           0
23                       10.4.3.65:53                     up     0.0
    0   9  1          0       0   0.0   0.0           0
24                       10.8.3.1:53                      up     0.0
    0   9  1         36       0   0.0  41.8           0
All                                                              1.0
                    229       2

# ...and traffic keeps getting sent to it, despite high latency and
higher numerical order in the active systems list.

> showServers()
#   Name                 Address                       State     Qps
 Qlim Ord Wt    Queries   Drops Drate   Lat Outstanding Pools
0                        127.0.0.1:53                   down     0.0
    0   0  1        107       2   0.0   0.5           0
1                        10.3.5.13:53                     up     0.0
    0   5  1         86       0   0.0   0.5           0
2                        10.3.5.14:53                   down     0.0
    0   5  1          0       0   0.0   0.0           0
3                        10.6.3.1:53                      up     0.0
    0   9  1          0       0   0.0   0.0           0
<snip>
24                       10.8.3.1:53                      up     0.8
    0   9  1        172       0   0.0 125.9           0
All                                                              0.0
                    365       2

# If I add a dummy entry at the end of the list w/ a higher priority,
things work as they're supposed to (although, I'm not
# completely convinced it's taking latency in consideration when it
fails over to all the same weighted systems, it seems to
# jump towards the end of that list regardless of latency).

> showServers()
#   Name                 Address                       State     Qps
 Qlim Ord Wt    Queries   Drops Drate   Lat Outstanding Pools
0                        127.0.0.1:53                   down     0.0
    0   0  1         35       1   0.0   1.4           0
1                        10.3.5.13:53                   down     0.0
    0   5  1         31       2   0.0   0.6           0
2                        10.3.5.14:53                   down     0.0
    0   5  1          0       0   0.0   0.0           0
3                        10.6.3.1:53                    DOWN     0.0
    0   6  1         32       0   0.0   0.4           0
4                        10.3.8.27:53                     up     1.0
    0   7  1        239       0   0.0   0.4           0
<snip>
21                       10.4.3.2:53                      up     0.0
    0   9  1          0       0   0.0   0.0           0
22                       10.4.3.66:53                     up     0.0
    0   9  1          0       0   0.0   0.0           0
23                       10.4.3.65:53                     up     0.0
    0   9  1        431       0   0.0 148.6           0
24                       10.8.3.1:53                      up     0.0
    0  19  1          0       0   0.0   0.0           0
25                       127.0.0.255:53                 down     0.0
    0  99  1          0       0   0.0   0.0           0
All                                                              0.0
                    768       3


More information about the dnsdist mailing list