[Pdns-dev] PDNS Recursor functionality request re:SERVFAIL outages of today
John Todd
jtodd at loligo.com
Sat Oct 22 16:51:23 UTC 2016
On 21 Oct 2016, at 16:53, John Todd wrote:
>
> As most of you know by now, today’s DynDNS outage due to DDoS attack
> caused fairly widespread outages across a large number of domains.
> Authoritative resolvers seem to be a particularly interesting target
> for attackers as they are often smaller in scope (IP address range,
> transit size of authoritative resolver networks) than a full service
> offering by a provider of multiple other services like HTTP. It seems
> that there may be some reasonable ways to respond to outages like this
> which at a minimum will result in failures that are less “bad”
> than having no replies at all, and which can be implemented by DNS
> recursors.
>
> I’d like to propose an extension to PowerDNS Recursor for mitigating
> (partially) events like we had today where major authoritative
> nameservers were put out of commission. This might be a particularly
> foolish or error-prone method - it only took me a few minutes to think
> up. But I’d at least like to hear a discussion as to why this
> isn’t a good idea. The comment of “But this might end up giving
> out the wrong answer!” is true, but I view a wrong answer as better
> than no answer. What would a domain operator USUALLY want to get?
> They’d want to get the inbound connection, rather than having users
> completely offline. This seems to be particularly valuable for TLD and
> other low-churn zones which may come under attack for various
> political reasons but which contain a significant number of NS
> records.
>
> Having done plenty of OSS work, I’m sure the next comment will be
> “patches welcome.” ;-) I would be happy to pay some small amount
> of dollars to someone to write this, but I have little budget, high
> hopes, and no coders on staff at this level yet otherwise I would do
> just that.
>
> PowerDNS Recursor proposed feature extensions:
>
> servfail-ttl-override
> * Integer
> * Default: 180
>
> The recursor keeps all records for this amount of seconds after TTL
> expiration. If the authoritative-provided TTL has expired, then lookup
> is performed on the query in a normal way. If that query fails due to
> a SERVFAIL, then the TTL timer on this “old” record is set back to
> zero and the “old” record is provided as a response. If an
> authoritative server is marked as “down” due to repeated SERVFAIL
> responses (see packetcache-servfail-ttl) then the “old” record is
> handed back immediately without a new query attempt, and the TTL timer
> is set back to zero to keep the answer in a state of perpetual
> validity as long as there are active queries occurring within the
> servfail-ttl-override interval and the authoritative server is
> resulting in SERVFAIL. (packetcache-servfail-ttl is on a rotating
> timer, and will try every X seconds, leading to one single query
> getting delays during the next attempt cycle - other queries are
> immediately replied to with the “old” answer.) An NXDOMAIN
> response from an authoritative server clears “old” records in
> memory immediately.
> This timer method is useful in situations where authoritative
> nameservers are being DDoS’ed and cannot provide responses, with the
> intent that some answer is better than no answer. If a domain operator
> wishes to stop traffic to their site, then replies with NXDOMAIN
> negate this behavior. Only a nameserver being unreachable will result
> in this cache being used as a last resort, and there is a timer for
> maximum duration of these old records being kept. Setting this value
> low will mean that highly-traffic’ed websites will typically always
> reply with a result even if the authoritative nameservers are
> unreachable due to attack or network disconnect, but less
> often-queried domains may be removed from the cache leading to query
> failures. Setting this value high may lead to unexpected results for
> infrequently-used domains which have dynamic results.
>
> servfail-ttl-override-domain-exceptions
> * Domains, comma separated
>
> List of domains on which we never use the servfail-TTL-override method
>
> servfail-ttl-override-server-exceptions
> * IP addresses, comma separated
>
> List of authoritative servers on which we never use the
> servfail-TTL-override method
>
> JT
>
After some thought in the shower this morning, I think I need to update
my original proposal. Instead of the refreshed timer being the TTL of
the original record, the new TTL should be set to be
packetcache-servfail-ttl. This means that a refreshed record will only
stay in the cache as long as the authoritative server is unreachable.
JT
More information about the Pdns-dev
mailing list