<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    Hi Brian, all,<br>

    <br>

    Now that I'm in front of the machine in question ... yes,

    specifically:<br>

    <br>

    # rpm -qa | grep pdns<br>

    pdns-4.1.5-1pdns.el7.x86_64<br>

    pdns-backend-sqlite-4.1.5-1pdns.el7.x86_64<br>

    <br>

    <br>

    Here's an example of the behaviour I'm seeing. First I'll set up a

    test domain:<br>

    <blockquote><tt># systemctl start pdns</tt><tt><br>

      </tt><tt># sqlite3 /var/pdns/pdns.sqlite3 <<'EOF'</tt><tt><br>

      </tt><tt>begin;</tt><tt><br>

      </tt><tt><br>

      </tt><tt>insert into domains (name, master, type) values (</tt><tt><br>

      </tt><tt>        'foo.example',</tt><tt><br>

      </tt><tt>        '10.200.200.109, 10.201.201.109',</tt><tt><br>

      </tt><tt>        'SLAVE');</tt><tt><br>

      </tt><tt><br>

      </tt><tt>insert into domainmetadata (domain_id, kind, content)

        values (</tt><tt><br>

      </tt><tt>        (select id from domains where

        name='foo.example'),</tt><tt><br>

      </tt><tt>        'AXFR-MASTER-TSIG',</tt><tt><br>

      </tt><tt>        'lshaklnsm001-lshaklnss002');</tt><tt><br>

      </tt><tt><br>

      </tt><tt>commit;</tt><tt><br>

      </tt><tt>EOF</tt><tt><br>

      </tt></blockquote>

    Note that 10.200.200.109 is responding. 10.201.201.109 does not

    actually exist and can therefore never respond. I've not enabled

    IXFR for this example, but have done and the behaviour is the same.<br>

    <br>

    The domain loads correctly (but see below). Then, with a TCPdump and

    tailing the log, I go to the BIND master, bump the serial and

    reload. I see this (log messages in bold):<br>

    <blockquote><tt>11:05:53.303682 IP (tos 0x0, ttl 64, id 19847,

        offset 0, flags [none], proto UDP (17), length 231)<br>

            10.200.200.109.10157 > 10.200.200.111.domain: [udp sum

        ok] 46400 notify [b2&3=0x2400] [1a] [1au] SOA? foo.example.

        foo.example. [0s] SOA ns1.foo.example. soa.foo.example. 4 3600

        600 3600000 300 ar: lshaklnsm001-lshaklnss002. ANY [0s] TSIG

        hmac-sha512. fudge=300 maclen=64 origid=46400 error=0 otherlen=0

        (203)<br>

        <br>

        11:05:53.304484 IP (tos 0x0, ttl 64, id 32302, offset 0, flags

        [DF], proto UDP (17), length 187)<br>

            10.200.200.111.domain > 10.200.200.109.10157: [bad udp

        cksum 0xa725 -> 0x3175!] 46400 notify*- q: SOA? foo.example.

        0/0/1 ar: lshaklnsm001-lshaklnss002. ANY [0s] TSIG hmac-sha512.

        fudge=300 maclen=64 origid=46400 error=0 otherlen=0 (159)<b><br>

          <br>

          Dec 17 11:05:53 LSHAKLNSS002 pdns[28646]: Received secure

          NOTIFY for foo.example from 10.200.200.109, allowed by TSIG

          key 'lshaklnsm001-lshaklnss002'<br>

          <br>

        </b>11:05:53.981005 IP (tos 0x0, ttl 64, id 18800, offset 0,

        flags [DF], proto UDP (17), length 198)<br>

            10.200.200.111.15760 > 10.201.201.109.domain: [bad udp

        cksum 0xa831 -> 0x4a0e!] 9591 [2au] SOA? foo.example. ar: .

        OPT UDPsize=2800 DO, lshaklnsm001-lshaklnss002. ANY [0s] TSIG

        hmac-sha512. fudge=300 maclen=64 origid=9591 error=0 otherlen=0

        (170)<b><br>

          <br>

          Dec 17 11:05:53 LSHAKLNSS002 pdns[28646]: 1 slave domain needs

          checking, 0 queued for AXFR<br>

          Dec 17 11:05:56 LSHAKLNSS002 pdns[28646]: Received serial

          number updates for 0 zones, had 1 timeout<br>

          Dec 17 11:05:56 LSHAKLNSS002 pdns[28646]: Unable to retrieve

          SOA for foo.example, this was the first time. NOTE: For every

          subsequent failed SOA check the domain will be suspended from

          freshness checks for 'num-errors x 10 seconds', with a maximum

          of 60 seconds. Skipping SOA checks until 1544997966<br>

        </b><br>

      </tt></blockquote>

    That is, I see the notify from the functional master, to which pdns

    accepts and responds correctly (ignore the bad checksums, that's

    just an artefact of hardware IP checksum offload). It then

    immediately requests an SOA from the <i>non-functional</i> master

    and complains that it never got a response. And that's the last we

    hear from it until the next refresh interval. There's absolutely no

    attempt to query the other (functional) master, or otherwise act on

    the (TSIG-signed) NOTIFY.<br>

    <br>

    Note that I see this behaviour when initially loading the domain as

    well. It seems to be a coin toss as to which master it queries, when

    really, it should be querying both every time, and tracking failures

    on a per master (or per master per domain) basis. <br>

    <br>

    Pulling the SOA refresh and retry numbers down seems to help a bit

    (the retry alone wasn't sufficient, which surprises me), but I still

    don't get immediate response, at the cost of more useless SOA query

    traffic. Also, it often seems to get stuck on querying the

    non-functional master for extended periods, and seems to pretty much

    always go to the non-functional master after a NOTIFY from the

    functional master. In short, it seems to do multiple master support

    wrong in every possible way. Notably:<br>

    <ul>

      <li>it should always query the master that sent it a NOTIFY;</li>

      <li>it should always query both masters on a refresh poll or on

        zone creation;<br>

      </li>

      <li>it should maintain a per-master (or per master per domain)

        retry algorithm on failure to respond; and<br>

      </li>

      <li>it should have separate TSIG keys per master (or per master

        per domain).<br>

      </li>

    </ul>

    As far as I can see, it does none of those things. Just randomly

    polls one of the masters each time, and treats any failure to

    respond as a failure of all listed masters. Which seems a bit

    pointless when you're using multiple masters to protect against a

    master becoming unavailable. Polling the <i>other</i> master after

    a NOTIFY is very wrong, because the other master may not have the

    update that triggered the NOTIFY yet, ant the poll will reset the

    refresh timer. (Even if it doesn't, the notify is essentially being

    ignored.)<br>

    <br>

    Is multiple master support <i>really</i> that badly broken, or am I

    missing something? Should I go to v4.2 and would it help? (I'm not a

    huge fan of using the latest bleeding edge, but this is <i>almost</i>

    a show-stopper for using pdns in our application.)<br>

    <br>

    <br>

    I could do something utterly disgusting like externally monitoring

    the availability of the masters and updating the DNS and removing

    the non-functional masters from all zones and re-adding them when

    they become available. But really, pdns should be doing this

    automatically, and anyway, that doesn't solve the problem of

    querying the other master before it receives its update.<br>

    <br>

    <br>

    -- don<br>

    <br>

    <br>

    <br>

    <br>

    <br>

    <br>

    <div class="moz-cite-prefix">On 15/12/18 10:10 PM, Brian Candler

      wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:b214a274-5df6-e704-c361-bc64dcc0d7f2@pobox.com">On

      15/12/2018 08:48, Don Stokes wrote:

      <br>

      <blockquote type="cite">

        <br>

        This is with the latest Centos 7 RPMs on the 4.1 branch.

        <br>

      </blockquote>

      For the benefit of anyone looking at the list archive in future: I

      *think* the OP is talking about version 4.1.5.

      <br>

    </blockquote>

    <br>

    <div class="moz-signature">-- <br>

      Don Stokes, <a href="mailto:don@nz.net">don@nz.net</a>, 021 796

      072</div>

  </body>

</html>