[Pdns-users] Cache Problems with upgrade to Recursor 3.3

Wed Dec 1 19:08:30 UTC 2010

On Wed, Dec 01, 2010 at 12:40:40PM -0600, Jeremy Utley wrote:
> Good afternoon,
>
> We've been working on upgrading our recursors from pdns-recursor-3.1.7.1-1 
> to pdns-recursor-3.3-1, and have seen some oddities I wanted to ask the 
> list about.  First, a basic rundown of our environment:
>
> Our existing production servers are running pdns-recursor-3.1.7.1-1 
> installed via RPMs downloaded from your website.  The recursor itself is 
> ran within a Xen PV virtual machine on a CentOS 5.5 base.  To ensure we 
> utilize all 4 cores of the processors in those machines, 2 instances of the 
> recursor are launched simultaneously, listening on different IP addresses, 
> and we utilize the fork option.  We have a total of 6 machines configured 
> this way, behind a Foundry load balancer which handles sharing the load 
> between them.  This implementation has been in place for about a year with 
> no issues.  We also use Cacti graphs for collecting performance data, by 
> extending SNMP with output from the rec_control command.
>
> The new test server is pdns-recursor-3.3-1 installed via RPM downloaded 
> from your website, and also running within a Xen PV virtual machine on a 
> CentOS 5.5 base.  Rather than launching multiple instances, we are 
> launching 4 recursor threads (machines have 4 CPU cores).  Most other 
> settings are configured identically between old and new servers.  This test 
> server was added to the load balancer on Monday afternoon, taking a 
> fraction of the traffic that would have gone to the 6 old machines.
>
> The problem I'm seeing is the caching does not seem to be working properly, 
> which is causing a performance hit.  To document this effect, the following 
> graph images were taken a little while ago from our Cacti installation:
>
> http://www.jutley.org/DNS
>
> Looking at the 4th graph down, which is the cache statistics on the old 
> version recursor, you will see that around 90% of all questions are cache 
> hits, with around 10% as cache misses.  And, looking at the third graph 
> (showing how fast queries are answered), you'll see that over 90% of all 
> queries are answered in less than 1 ms.
>
> However, looking at the bottom graph, which is the cache statistics on the 
> new recursor, the statistics are totally different.  Only 1.1% of the total 
> questions are cache hits, while 6.8% are cache misses, which to me makes no 
> sense, since a question *HAS* to be either a cache hit or cache miss.  And, 
> looking at the 7th graph (answer speed on the new recursor version), most 
> queries are taking more than 10ms to answer.
>
> Just as additional info, the data collected by cacti to generate these 
> graphs comes from the following command:
>
> /usr/bin/rec_control get questions cache-entries cache-hits cache-misses 
> concurrent-queries resource-limits unauthorized-tcp unauthorized-udp 
> spoof-prevents answers-slow client-parse-errors answers0-1 answers1-10 
> answers10-100 answers100-1000 qa-latency
>
> Am I mis-interpreting this, or is there something definately going on?
>
> Thanks for your time,
>
> Jeremy

Hi Jeremy,

You are not including the statistics for packetcache-hits/misses. If
it hits their it will not check the cache. I would bet that your
packetcache-hits are pretty substantial. Ours are almost 3X the
cache-hits.

Cheers,
Ken