[Pdns-users] Cache Problems with upgrade to Recursor 3.3

Wed Dec 1 18:40:40 UTC 2010

Good afternoon,

We've been working on upgrading our recursors from 
pdns-recursor-3.1.7.1-1 to pdns-recursor-3.3-1, and have seen some 
oddities I wanted to ask the list about.  First, a basic rundown of our 
environment:

Our existing production servers are running pdns-recursor-3.1.7.1-1 
installed via RPMs downloaded from your website.  The recursor itself is 
ran within a Xen PV virtual machine on a CentOS 5.5 base.  To ensure we 
utilize all 4 cores of the processors in those machines, 2 instances of 
the recursor are launched simultaneously, listening on different IP 
addresses, and we utilize the fork option.  We have a total of 6 
machines configured this way, behind a Foundry load balancer which 
handles sharing the load between them.  This implementation has been in 
place for about a year with no issues.  We also use Cacti graphs for 
collecting performance data, by extending SNMP with output from the 
rec_control command.

The new test server is pdns-recursor-3.3-1 installed via RPM downloaded 
from your website, and also running within a Xen PV virtual machine on a 
CentOS 5.5 base.  Rather than launching multiple instances, we are 
launching 4 recursor threads (machines have 4 CPU cores).  Most other 
settings are configured identically between old and new servers.  This 
test server was added to the load balancer on Monday afternoon, taking a 
fraction of the traffic that would have gone to the 6 old machines.

The problem I'm seeing is the caching does not seem to be working 
properly, which is causing a performance hit.  To document this effect, 
the following graph images were taken a little while ago from our Cacti 
installation:

http://www.jutley.org/DNS

Looking at the 4th graph down, which is the cache statistics on the old 
version recursor, you will see that around 90% of all questions are 
cache hits, with around 10% as cache misses.  And, looking at the third 
graph (showing how fast queries are answered), you'll see that over 90% 
of all queries are answered in less than 1 ms.

However, looking at the bottom graph, which is the cache statistics on 
the new recursor, the statistics are totally different.  Only 1.1% of 
the total questions are cache hits, while 6.8% are cache misses, which 
to me makes no sense, since a question *HAS* to be either a cache hit or 
cache miss.  And, looking at the 7th graph (answer speed on the new 
recursor version), most queries are taking more than 10ms to answer.

Just as additional info, the data collected by cacti to generate these 
graphs comes from the following command:

/usr/bin/rec_control get questions cache-entries cache-hits cache-misses 
concurrent-queries resource-limits unauthorized-tcp unauthorized-udp 
spoof-prevents answers-slow client-parse-errors answers0-1 answers1-10 
answers10-100 answers100-1000 qa-latency

Am I mis-interpreting this, or is there something definately going on?

Thanks for your time,

Jeremy