[Pdns-users] Odd Recursor problems

Fri Jan 20 19:51:55 UTC 2012

Hello all,

We're having some odd intermittent problems with our recursor which I'm 
not sure if I should be concerned or not about them.  It seems that
intermittently when we query our recursors for a CNAME record, we're not 
getting a proper response.  I am going to be detailed about the problem,
so this will be a long message, and I apologize in advance for that.  
However, I've about reached my wits end with trying to diagnose this issue.

The problem began when we started getting reports from our clients that 
intermittently their CSS files were not loading.  CSS files are stored with
static images on the Edgecast and Level 3 CDN systems, and 
troubleshooting the chain led us to doing a bunch of DNS tests, and 
that's where things
started getting suspicious.

We're running 6 recursors, all behind a Foundry load balancer, with 
virtual IP's funneling traffic from on-site machines to the recursors.  All
recursors are running the x86_64 RPM of pdns-recursor 3.3 downloaded 
directly from the web site, and the OS is CentOS 5.x.  Until now, we haven't
seen any issues with this setup, and it's been in production for over 3 
years.

Edgecast/Level3 have us setup CDN by creating a CNAME record which 
points at their systems - i.e.

  cdn.domain.com 43200 IN CNAME wpc.1737.edgecastcdn.net.

As part of our troubleshooting, we set up a number of checks within our 
nagios monitoring software to monitor the resolution of these entries.
By use of the nagios "check_dig" plugin, we are able to do resolution 
checks against all 6 of our DNS servers once per minute.  Essentially, 
we have
the plugin running these commands every minute:

  dig @{nameserver-ip} any cdn.domain.com
  dig @{nameserver-ip} a cdn.domain.com
  dig @{nameserver-ip} cname cdn.domain.com

With these tests in place and firing off every minute, we see 
intermittent failures (No ANSWER SECTION found) when querying our 
recursor for A
or ANY, never for CNAME.  When a check fails, on the next check one 
minute later, it passes.  We have a couple of machines that run their 
own BIND
caching nameserver, performing the same tests on them show no issues.  
Also as a test, we set up a dummy record with a CNAME to host on a totally
separate, lightly used authoritative server, and those tests have never 
shown failures either.

The failures appear to be totally random - you might see 2 or 3 failures 
within 15 minutes, and then you might not see another failure for over an
hour.

The syslogs for the recursors also show nothing out of the ordinary.

Right now, I am working under the thought that occasionally, the 
recursor does not get a timely response from the Edgecast/Level3 
authoritative
servers, and is therefore failing.  However, it does seem odd that I 
wouldnt' see the problem with our standalone BIND servers.  One other thing
I have done for testing is to disable load-balanced traffic to one of 
our 6 nameservers, and turned on the recursor trace mode on that nameserver.
However, even with only a few checks every minute addressed to it, 
piecing together the trace logs is still not real easy.

Does anyone else have any thoughts on this?

Thanks for any assistance you can give me!

Jeremy Utley