<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Hi Brian, all,<br>
<br>
Now that I'm in front of the machine in question ... yes,
specifically:<br>
<br>
# rpm -qa | grep pdns<br>
pdns-4.1.5-1pdns.el7.x86_64<br>
pdns-backend-sqlite-4.1.5-1pdns.el7.x86_64<br>
<br>
<br>
Here's an example of the behaviour I'm seeing. First I'll set up a
test domain:<br>
<blockquote><tt># systemctl start pdns</tt><tt><br>
</tt><tt># sqlite3 /var/pdns/pdns.sqlite3 <<'EOF'</tt><tt><br>
</tt><tt>begin;</tt><tt><br>
</tt><tt><br>
</tt><tt>insert into domains (name, master, type) values (</tt><tt><br>
</tt><tt> 'foo.example',</tt><tt><br>
</tt><tt> '10.200.200.109, 10.201.201.109',</tt><tt><br>
</tt><tt> 'SLAVE');</tt><tt><br>
</tt><tt><br>
</tt><tt>insert into domainmetadata (domain_id, kind, content)
values (</tt><tt><br>
</tt><tt> (select id from domains where
name='foo.example'),</tt><tt><br>
</tt><tt> 'AXFR-MASTER-TSIG',</tt><tt><br>
</tt><tt> 'lshaklnsm001-lshaklnss002');</tt><tt><br>
</tt><tt><br>
</tt><tt>commit;</tt><tt><br>
</tt><tt>EOF</tt><tt><br>
</tt></blockquote>
Note that 10.200.200.109 is responding. 10.201.201.109 does not
actually exist and can therefore never respond. I've not enabled
IXFR for this example, but have done and the behaviour is the same.<br>
<br>
The domain loads correctly (but see below). Then, with a TCPdump and
tailing the log, I go to the BIND master, bump the serial and
reload. I see this (log messages in bold):<br>
<blockquote><tt>11:05:53.303682 IP (tos 0x0, ttl 64, id 19847,
offset 0, flags [none], proto UDP (17), length 231)<br>
10.200.200.109.10157 > 10.200.200.111.domain: [udp sum
ok] 46400 notify [b2&3=0x2400] [1a] [1au] SOA? foo.example.
foo.example. [0s] SOA ns1.foo.example. soa.foo.example. 4 3600
600 3600000 300 ar: lshaklnsm001-lshaklnss002. ANY [0s] TSIG
hmac-sha512. fudge=300 maclen=64 origid=46400 error=0 otherlen=0
(203)<br>
<br>
11:05:53.304484 IP (tos 0x0, ttl 64, id 32302, offset 0, flags
[DF], proto UDP (17), length 187)<br>
10.200.200.111.domain > 10.200.200.109.10157: [bad udp
cksum 0xa725 -> 0x3175!] 46400 notify*- q: SOA? foo.example.
0/0/1 ar: lshaklnsm001-lshaklnss002. ANY [0s] TSIG hmac-sha512.
fudge=300 maclen=64 origid=46400 error=0 otherlen=0 (159)<b><br>
<br>
Dec 17 11:05:53 LSHAKLNSS002 pdns[28646]: Received secure
NOTIFY for foo.example from 10.200.200.109, allowed by TSIG
key 'lshaklnsm001-lshaklnss002'<br>
<br>
</b>11:05:53.981005 IP (tos 0x0, ttl 64, id 18800, offset 0,
flags [DF], proto UDP (17), length 198)<br>
10.200.200.111.15760 > 10.201.201.109.domain: [bad udp
cksum 0xa831 -> 0x4a0e!] 9591 [2au] SOA? foo.example. ar: .
OPT UDPsize=2800 DO, lshaklnsm001-lshaklnss002. ANY [0s] TSIG
hmac-sha512. fudge=300 maclen=64 origid=9591 error=0 otherlen=0
(170)<b><br>
<br>
Dec 17 11:05:53 LSHAKLNSS002 pdns[28646]: 1 slave domain needs
checking, 0 queued for AXFR<br>
Dec 17 11:05:56 LSHAKLNSS002 pdns[28646]: Received serial
number updates for 0 zones, had 1 timeout<br>
Dec 17 11:05:56 LSHAKLNSS002 pdns[28646]: Unable to retrieve
SOA for foo.example, this was the first time. NOTE: For every
subsequent failed SOA check the domain will be suspended from
freshness checks for 'num-errors x 10 seconds', with a maximum
of 60 seconds. Skipping SOA checks until 1544997966<br>
</b><br>
</tt></blockquote>
That is, I see the notify from the functional master, to which pdns
accepts and responds correctly (ignore the bad checksums, that's
just an artefact of hardware IP checksum offload). It then
immediately requests an SOA from the <i>non-functional</i> master
and complains that it never got a response. And that's the last we
hear from it until the next refresh interval. There's absolutely no
attempt to query the other (functional) master, or otherwise act on
the (TSIG-signed) NOTIFY.<br>
<br>
Note that I see this behaviour when initially loading the domain as
well. It seems to be a coin toss as to which master it queries, when
really, it should be querying both every time, and tracking failures
on a per master (or per master per domain) basis. <br>
<br>
Pulling the SOA refresh and retry numbers down seems to help a bit
(the retry alone wasn't sufficient, which surprises me), but I still
don't get immediate response, at the cost of more useless SOA query
traffic. Also, it often seems to get stuck on querying the
non-functional master for extended periods, and seems to pretty much
always go to the non-functional master after a NOTIFY from the
functional master. In short, it seems to do multiple master support
wrong in every possible way. Notably:<br>
<ul>
<li>it should always query the master that sent it a NOTIFY;</li>
<li>it should always query both masters on a refresh poll or on
zone creation;<br>
</li>
<li>it should maintain a per-master (or per master per domain)
retry algorithm on failure to respond; and<br>
</li>
<li>it should have separate TSIG keys per master (or per master
per domain).<br>
</li>
</ul>
As far as I can see, it does none of those things. Just randomly
polls one of the masters each time, and treats any failure to
respond as a failure of all listed masters. Which seems a bit
pointless when you're using multiple masters to protect against a
master becoming unavailable. Polling the <i>other</i> master after
a NOTIFY is very wrong, because the other master may not have the
update that triggered the NOTIFY yet, ant the poll will reset the
refresh timer. (Even if it doesn't, the notify is essentially being
ignored.)<br>
<br>
Is multiple master support <i>really</i> that badly broken, or am I
missing something? Should I go to v4.2 and would it help? (I'm not a
huge fan of using the latest bleeding edge, but this is <i>almost</i>
a show-stopper for using pdns in our application.)<br>
<br>
<br>
I could do something utterly disgusting like externally monitoring
the availability of the masters and updating the DNS and removing
the non-functional masters from all zones and re-adding them when
they become available. But really, pdns should be doing this
automatically, and anyway, that doesn't solve the problem of
querying the other master before it receives its update.<br>
<br>
<br>
-- don<br>
<br>
<br>
<br>
<br>
<br>
<br>
<div class="moz-cite-prefix">On 15/12/18 10:10 PM, Brian Candler
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:b214a274-5df6-e704-c361-bc64dcc0d7f2@pobox.com">On
15/12/2018 08:48, Don Stokes wrote:
<br>
<blockquote type="cite">
<br>
This is with the latest Centos 7 RPMs on the 4.1 branch.
<br>
</blockquote>
For the benefit of anyone looking at the list archive in future: I
*think* the OP is talking about version 4.1.5.
<br>
</blockquote>
<br>
<div class="moz-signature">-- <br>
Don Stokes, <a href="mailto:don@nz.net">don@nz.net</a>, 021 796
072</div>
</body>
</html>