[Pdns-dev] PDNS Recursor functionality request re:SERVFAIL outages of today
John Todd
jtodd at loligo.com
Fri Oct 21 23:53:35 UTC 2016
As most of you know by now, today’s DynDNS outage due to DDoS attack
caused fairly widespread outages across a large number of domains.
Authoritative resolvers seem to be a particularly interesting target for
attackers as they are often smaller in scope (IP address range, transit
size of authoritative resolver networks) than a full service offering by
a provider of multiple other services like HTTP. It seems that there may
be some reasonable ways to respond to outages like this which at a
minimum will result in failures that are less “bad” than having no
replies at all, and which can be implemented by DNS recursors.
I’d like to propose an extension to PowerDNS Recursor for mitigating
(partially) events like we had today where major authoritative
nameservers were put out of commission. This might be a particularly
foolish or error-prone method - it only took me a few minutes to think
up. But I’d at least like to hear a discussion as to why this isn’t
a good idea. The comment of “But this might end up giving out the
wrong answer!” is true, but I view a wrong answer as better than no
answer. What would a domain operator USUALLY want to get? They’d want
to get the inbound connection, rather than having users completely
offline. This seems to be particularly valuable for TLD and other
low-churn zones which may come under attack for various political
reasons but which contain a significant number of NS records.
Having done plenty of OSS work, I’m sure the next comment will be
“patches welcome.” ;-) I would be happy to pay some small amount of
dollars to someone to write this, but I have little budget, high hopes,
and no coders on staff at this level yet otherwise I would do just that.
PowerDNS Recursor proposed feature extensions:
servfail-ttl-override
* Integer
* Default: 180
The recursor keeps all records for this amount of seconds after TTL
expiration. If the authoritative-provided TTL has expired, then lookup
is performed on the query in a normal way. If that query fails due to a
SERVFAIL, then the TTL timer on this “old” record is set back to
zero and the “old” record is provided as a response. If an
authoritative server is marked as “down” due to repeated SERVFAIL
responses (see packetcache-servfail-ttl) then the “old” record is
handed back immediately without a new query attempt, and the TTL timer
is set back to zero to keep the answer in a state of perpetual validity
as long as there are active queries occurring within the
servfail-ttl-override interval and the authoritative server is resulting
in SERVFAIL. (packetcache-servfail-ttl is on a rotating timer, and will
try every X seconds, leading to one single query getting delays during
the next attempt cycle - other queries are immediately replied to with
the “old” answer.) An NXDOMAIN response from an authoritative server
clears “old” records in memory immediately.
This timer method is useful in situations where authoritative
nameservers are being DDoS’ed and cannot provide responses, with the
intent that some answer is better than no answer. If a domain operator
wishes to stop traffic to their site, then replies with NXDOMAIN negate
this behavior. Only a nameserver being unreachable will result in this
cache being used as a last resort, and there is a timer for maximum
duration of these old records being kept. Setting this value low will
mean that highly-traffic’ed websites will typically always reply with
a result even if the authoritative nameservers are unreachable due to
attack or network disconnect, but less often-queried domains may be
removed from the cache leading to query failures. Setting this value
high may lead to unexpected results for infrequently-used domains which
have dynamic results.
servfail-ttl-override-domain-exceptions
* Domains, comma separated
List of domains on which we never use the servfail-TTL-override method
servfail-ttl-override-server-exceptions
* IP addresses, comma separated
List of authoritative servers on which we never use the
servfail-TTL-override method
JT
More information about the Pdns-dev
mailing list