[Pdns-dev] PDNS Recursor functionality request re:SERVFAIL outages of today
    John Todd 
    jtodd at loligo.com
       
    Sat Oct 22 16:51:23 UTC 2016
    
    
  
On 21 Oct 2016, at 16:53, John Todd wrote:
>
> As most of you know by now, today’s DynDNS outage due to DDoS attack 
> caused fairly widespread outages across a large number of domains. 
> Authoritative resolvers seem to be a particularly interesting target 
> for attackers as they are often smaller in scope (IP address range, 
> transit size of authoritative resolver networks) than a full service 
> offering by a provider of multiple other services like HTTP. It seems 
> that there may be some reasonable ways to respond to outages like this 
> which at a minimum will result in failures that are less “bad” 
> than having no replies at all, and which can be implemented by DNS 
> recursors.
>
> I’d like to propose an extension to PowerDNS Recursor for mitigating 
> (partially) events like we had today where major authoritative 
> nameservers were put out of commission. This might be a particularly 
> foolish or error-prone method - it only took me a few minutes to think 
> up. But I’d at least like to hear a discussion as to why this 
> isn’t a good idea. The comment of “But this might end up giving 
> out the wrong answer!” is true, but I view a wrong answer as better 
> than no answer. What would a domain operator USUALLY want to get? 
> They’d want to get the inbound connection, rather than having users 
> completely offline. This seems to be particularly valuable for TLD and 
> other low-churn zones which may come under attack for various 
> political reasons but which contain a significant number of NS 
> records.
>
> Having done plenty of OSS work, I’m sure the next comment will be 
> “patches welcome.” ;-) I would be happy to pay some small amount 
> of dollars to someone to write this, but I have little budget, high 
> hopes, and no coders on staff at this level yet otherwise I would do 
> just that.
>
> PowerDNS Recursor proposed feature extensions:
>
> servfail-ttl-override
> * Integer
> * Default: 180
>
> The recursor keeps all records for this amount of seconds after TTL 
> expiration. If the authoritative-provided TTL has expired, then lookup 
> is performed on the query in a normal way. If that query fails due to 
> a SERVFAIL, then the TTL timer on this “old” record is set back to 
> zero and the “old” record is provided as a response. If an 
> authoritative server is marked as “down” due to repeated SERVFAIL 
> responses (see packetcache-servfail-ttl) then the “old” record is 
> handed back immediately without a new query attempt, and the TTL timer 
> is set back to zero to keep the answer in a state of perpetual 
> validity as long as there are active queries occurring within the 
> servfail-ttl-override interval and the authoritative server is 
> resulting in SERVFAIL. (packetcache-servfail-ttl is on a rotating 
> timer, and will try every X seconds, leading to one single query 
> getting delays during the next attempt cycle - other queries are 
> immediately replied to with the “old” answer.) An NXDOMAIN 
> response from an authoritative server clears “old” records in 
> memory immediately.
> This timer method is useful in situations where authoritative 
> nameservers are being DDoS’ed and cannot provide responses, with the 
> intent that some answer is better than no answer. If a domain operator 
> wishes to stop traffic to their site, then replies with NXDOMAIN 
> negate this behavior. Only a nameserver being unreachable will result 
> in this cache being used as a last resort, and there is a timer for 
> maximum duration of these old records being kept. Setting this value 
> low will mean that highly-traffic’ed websites will typically always 
> reply with a result even if the authoritative nameservers are 
> unreachable due to attack or network disconnect, but less 
> often-queried domains may be removed from the cache leading to query 
> failures. Setting this value high may lead to unexpected results for 
> infrequently-used domains which have dynamic results.
>
> servfail-ttl-override-domain-exceptions
> * Domains, comma separated
>
> List of domains on which we never use the servfail-TTL-override method
>
> servfail-ttl-override-server-exceptions
> * IP addresses, comma separated
>
> List of authoritative servers on which we never use the 
> servfail-TTL-override method
>
> JT
>
After some thought in the shower this morning, I think I need to update 
my original proposal.  Instead of the refreshed timer being the TTL of 
the original record, the new TTL should be set to be 
packetcache-servfail-ttl.  This means that a refreshed record will only 
stay in the cache as long as the authoritative server is unreachable.
JT
    
    
More information about the Pdns-dev
mailing list