[Pdns-users] Multiple masters
Don Stokes
don at nz.net
Sun Dec 16 23:22:22 UTC 2018
Hi Brian, all,
Now that I'm in front of the machine in question ... yes, specifically:
# rpm -qa | grep pdns
pdns-4.1.5-1pdns.el7.x86_64
pdns-backend-sqlite-4.1.5-1pdns.el7.x86_64
Here's an example of the behaviour I'm seeing. First I'll set up a test
domain:
# systemctl start pdns
# sqlite3 /var/pdns/pdns.sqlite3 <<'EOF'
begin;
insert into domains (name, master, type) values (
'foo.example',
'10.200.200.109, 10.201.201.109',
'SLAVE');
insert into domainmetadata (domain_id, kind, content) values (
(select id from domains where name='foo.example'),
'AXFR-MASTER-TSIG',
'lshaklnsm001-lshaklnss002');
commit;
EOF
Note that 10.200.200.109 is responding. 10.201.201.109 does not actually
exist and can therefore never respond. I've not enabled IXFR for this
example, but have done and the behaviour is the same.
The domain loads correctly (but see below). Then, with a TCPdump and
tailing the log, I go to the BIND master, bump the serial and reload. I
see this (log messages in bold):
11:05:53.303682 IP (tos 0x0, ttl 64, id 19847, offset 0, flags
[none], proto UDP (17), length 231)
10.200.200.109.10157 > 10.200.200.111.domain: [udp sum ok]
46400 notify [b2&3=0x2400] [1a] [1au] SOA? foo.example. foo.example.
[0s] SOA ns1.foo.example. soa.foo.example. 4 3600 600 3600000 300
ar: lshaklnsm001-lshaklnss002. ANY [0s] TSIG hmac-sha512. fudge=300
maclen=64 origid=46400 error=0 otherlen=0 (203)
11:05:53.304484 IP (tos 0x0, ttl 64, id 32302, offset 0, flags [DF],
proto UDP (17), length 187)
10.200.200.111.domain > 10.200.200.109.10157: [bad udp cksum
0xa725 -> 0x3175!] 46400 notify*- q: SOA? foo.example. 0/0/1 ar:
lshaklnsm001-lshaklnss002. ANY [0s] TSIG hmac-sha512. fudge=300
maclen=64 origid=46400 error=0 otherlen=0 (159)*
Dec 17 11:05:53 LSHAKLNSS002 pdns[28646]: Received secure NOTIFY for
foo.example from 10.200.200.109, allowed by TSIG key
'lshaklnsm001-lshaklnss002'
*11:05:53.981005 IP (tos 0x0, ttl 64, id 18800, offset 0, flags
[DF], proto UDP (17), length 198)
10.200.200.111.15760 > 10.201.201.109.domain: [bad udp cksum
0xa831 -> 0x4a0e!] 9591 [2au] SOA? foo.example. ar: . OPT
UDPsize=2800 DO, lshaklnsm001-lshaklnss002. ANY [0s] TSIG
hmac-sha512. fudge=300 maclen=64 origid=9591 error=0 otherlen=0 (170)*
Dec 17 11:05:53 LSHAKLNSS002 pdns[28646]: 1 slave domain needs
checking, 0 queued for AXFR
Dec 17 11:05:56 LSHAKLNSS002 pdns[28646]: Received serial number
updates for 0 zones, had 1 timeout
Dec 17 11:05:56 LSHAKLNSS002 pdns[28646]: Unable to retrieve SOA for
foo.example, this was the first time. NOTE: For every subsequent
failed SOA check the domain will be suspended from freshness checks
for 'num-errors x 10 seconds', with a maximum of 60 seconds.
Skipping SOA checks until 1544997966
*
That is, I see the notify from the functional master, to which pdns
accepts and responds correctly (ignore the bad checksums, that's just an
artefact of hardware IP checksum offload). It then immediately requests
an SOA from the /non-functional/ master and complains that it never got
a response. And that's the last we hear from it until the next refresh
interval. There's absolutely no attempt to query the other (functional)
master, or otherwise act on the (TSIG-signed) NOTIFY.
Note that I see this behaviour when initially loading the domain as
well. It seems to be a coin toss as to which master it queries, when
really, it should be querying both every time, and tracking failures on
a per master (or per master per domain) basis.
Pulling the SOA refresh and retry numbers down seems to help a bit (the
retry alone wasn't sufficient, which surprises me), but I still don't
get immediate response, at the cost of more useless SOA query traffic.
Also, it often seems to get stuck on querying the non-functional master
for extended periods, and seems to pretty much always go to the
non-functional master after a NOTIFY from the functional master. In
short, it seems to do multiple master support wrong in every possible
way. Notably:
* it should always query the master that sent it a NOTIFY;
* it should always query both masters on a refresh poll or on zone
creation;
* it should maintain a per-master (or per master per domain) retry
algorithm on failure to respond; and
* it should have separate TSIG keys per master (or per master per domain).
As far as I can see, it does none of those things. Just randomly polls
one of the masters each time, and treats any failure to respond as a
failure of all listed masters. Which seems a bit pointless when you're
using multiple masters to protect against a master becoming unavailable.
Polling the /other/ master after a NOTIFY is very wrong, because the
other master may not have the update that triggered the NOTIFY yet, ant
the poll will reset the refresh timer. (Even if it doesn't, the notify
is essentially being ignored.)
Is multiple master support /really/ that badly broken, or am I missing
something? Should I go to v4.2 and would it help? (I'm not a huge fan of
using the latest bleeding edge, but this is /almost/ a show-stopper for
using pdns in our application.)
I could do something utterly disgusting like externally monitoring the
availability of the masters and updating the DNS and removing the
non-functional masters from all zones and re-adding them when they
become available. But really, pdns should be doing this automatically,
and anyway, that doesn't solve the problem of querying the other master
before it receives its update.
-- don
On 15/12/18 10:10 PM, Brian Candler wrote:
> On 15/12/2018 08:48, Don Stokes wrote:
>>
>> This is with the latest Centos 7 RPMs on the 4.1 branch.
> For the benefit of anyone looking at the list archive in future: I
> *think* the OP is talking about version 4.1.5.
--
Don Stokes, don at nz.net <mailto:don at nz.net>, 021 796 072
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.powerdns.com/pipermail/pdns-users/attachments/20181217/3ad288c7/attachment.html>
More information about the Pdns-users
mailing list