[Pdns-users] Multiple masters

Sun Dec 16 23:22:22 UTC 2018

Hi Brian, all,

Now that I'm in front of the machine in question ... yes, specifically:

# rpm -qa | grep pdns
pdns-4.1.5-1pdns.el7.x86_64
pdns-backend-sqlite-4.1.5-1pdns.el7.x86_64

Here's an example of the behaviour I'm seeing. First I'll set up a test 
domain:

    # systemctl start pdns
    # sqlite3 /var/pdns/pdns.sqlite3 <<'EOF'
    begin;

    insert into domains (name, master, type) values (
             'foo.example',
             '10.200.200.109, 10.201.201.109',
             'SLAVE');

    insert into domainmetadata (domain_id, kind, content) values (
             (select id from domains where name='foo.example'),
             'AXFR-MASTER-TSIG',
             'lshaklnsm001-lshaklnss002');

    commit;
    EOF

Note that 10.200.200.109 is responding. 10.201.201.109 does not actually 
exist and can therefore never respond. I've not enabled IXFR for this 
example, but have done and the behaviour is the same.

The domain loads correctly (but see below). Then, with a TCPdump and 
tailing the log, I go to the BIND master, bump the serial and reload. I 
see this (log messages in bold):

    11:05:53.303682 IP (tos 0x0, ttl 64, id 19847, offset 0, flags
    [none], proto UDP (17), length 231)
         10.200.200.109.10157 > 10.200.200.111.domain: [udp sum ok]
    46400 notify [b2&3=0x2400] [1a] [1au] SOA? foo.example. foo.example.
    [0s] SOA ns1.foo.example. soa.foo.example. 4 3600 600 3600000 300
    ar: lshaklnsm001-lshaklnss002. ANY [0s] TSIG hmac-sha512. fudge=300
    maclen=64 origid=46400 error=0 otherlen=0 (203)

    11:05:53.304484 IP (tos 0x0, ttl 64, id 32302, offset 0, flags [DF],
    proto UDP (17), length 187)
         10.200.200.111.domain > 10.200.200.109.10157: [bad udp cksum
    0xa725 -> 0x3175!] 46400 notify*- q: SOA? foo.example. 0/0/1 ar:
    lshaklnsm001-lshaklnss002. ANY [0s] TSIG hmac-sha512. fudge=300
    maclen=64 origid=46400 error=0 otherlen=0 (159)*

    Dec 17 11:05:53 LSHAKLNSS002 pdns[28646]: Received secure NOTIFY for
    foo.example from 10.200.200.109, allowed by TSIG key
    'lshaklnsm001-lshaklnss002'

    *11:05:53.981005 IP (tos 0x0, ttl 64, id 18800, offset 0, flags
    [DF], proto UDP (17), length 198)
         10.200.200.111.15760 > 10.201.201.109.domain: [bad udp cksum
    0xa831 -> 0x4a0e!] 9591 [2au] SOA? foo.example. ar: . OPT
    UDPsize=2800 DO, lshaklnsm001-lshaklnss002. ANY [0s] TSIG
    hmac-sha512. fudge=300 maclen=64 origid=9591 error=0 otherlen=0 (170)*

    Dec 17 11:05:53 LSHAKLNSS002 pdns[28646]: 1 slave domain needs
    checking, 0 queued for AXFR
    Dec 17 11:05:56 LSHAKLNSS002 pdns[28646]: Received serial number
    updates for 0 zones, had 1 timeout
    Dec 17 11:05:56 LSHAKLNSS002 pdns[28646]: Unable to retrieve SOA for
    foo.example, this was the first time. NOTE: For every subsequent
    failed SOA check the domain will be suspended from freshness checks
    for 'num-errors x 10 seconds', with a maximum of 60 seconds.
    Skipping SOA checks until 1544997966
    *

That is, I see the notify from the functional master, to which pdns 
accepts and responds correctly (ignore the bad checksums, that's just an 
artefact of hardware IP checksum offload). It then immediately requests 
an SOA from the /non-functional/ master and complains that it never got 
a response. And that's the last we hear from it until the next refresh 
interval. There's absolutely no attempt to query the other (functional) 
master, or otherwise act on the (TSIG-signed) NOTIFY.

Note that I see this behaviour when initially loading the domain as 
well. It seems to be a coin toss as to which master it queries, when 
really, it should be querying both every time, and tracking failures on 
a per master (or per master per domain) basis.

Pulling the SOA refresh and retry numbers down seems to help a bit (the 
retry alone wasn't sufficient, which surprises me), but I still don't 
get immediate response, at the cost of more useless SOA query traffic. 
Also, it often seems to get stuck on querying the non-functional master 
for extended periods, and seems to pretty much always go to the 
non-functional master after a NOTIFY from the functional master. In 
short, it seems to do multiple master support wrong in every possible 
way. Notably:

  * it should always query the master that sent it a NOTIFY;
  * it should always query both masters on a refresh poll or on zone
    creation;
  * it should maintain a per-master (or per master per domain) retry
    algorithm on failure to respond; and
  * it should have separate TSIG keys per master (or per master per domain).

As far as I can see, it does none of those things. Just randomly polls 
one of the masters each time, and treats any failure to respond as a 
failure of all listed masters. Which seems a bit pointless when you're 
using multiple masters to protect against a master becoming unavailable. 
Polling the /other/ master after a NOTIFY is very wrong, because the 
other master may not have the update that triggered the NOTIFY yet, ant 
the poll will reset the refresh timer. (Even if it doesn't, the notify 
is essentially being ignored.)

Is multiple master support /really/ that badly broken, or am I missing 
something? Should I go to v4.2 and would it help? (I'm not a huge fan of 
using the latest bleeding edge, but this is /almost/ a show-stopper for 
using pdns in our application.)

I could do something utterly disgusting like externally monitoring the 
availability of the masters and updating the DNS and removing the 
non-functional masters from all zones and re-adding them when they 
become available. But really, pdns should be doing this automatically, 
and anyway, that doesn't solve the problem of querying the other master 
before it receives its update.

-- don

On 15/12/18 10:10 PM, Brian Candler wrote:
> On 15/12/2018 08:48, Don Stokes wrote:
>>
>> This is with the latest Centos 7 RPMs on the 4.1 branch.
> For the benefit of anyone looking at the list archive in future: I 
> *think* the OP is talking about version 4.1.5.

-- 
Don Stokes, don at nz.net <mailto:don at nz.net>, 021 796 072
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.powerdns.com/pipermail/pdns-users/attachments/20181217/3ad288c7/attachment.html>