[Pdns-users] Cache Problems with upgrade to Recursor 3.3
Jeremy Utley
pdns at gammanetworking.com
Wed Dec 1 18:40:40 UTC 2010
Good afternoon,
We've been working on upgrading our recursors from
pdns-recursor-3.1.7.1-1 to pdns-recursor-3.3-1, and have seen some
oddities I wanted to ask the list about. First, a basic rundown of our
environment:
Our existing production servers are running pdns-recursor-3.1.7.1-1
installed via RPMs downloaded from your website. The recursor itself is
ran within a Xen PV virtual machine on a CentOS 5.5 base. To ensure we
utilize all 4 cores of the processors in those machines, 2 instances of
the recursor are launched simultaneously, listening on different IP
addresses, and we utilize the fork option. We have a total of 6
machines configured this way, behind a Foundry load balancer which
handles sharing the load between them. This implementation has been in
place for about a year with no issues. We also use Cacti graphs for
collecting performance data, by extending SNMP with output from the
rec_control command.
The new test server is pdns-recursor-3.3-1 installed via RPM downloaded
from your website, and also running within a Xen PV virtual machine on a
CentOS 5.5 base. Rather than launching multiple instances, we are
launching 4 recursor threads (machines have 4 CPU cores). Most other
settings are configured identically between old and new servers. This
test server was added to the load balancer on Monday afternoon, taking a
fraction of the traffic that would have gone to the 6 old machines.
The problem I'm seeing is the caching does not seem to be working
properly, which is causing a performance hit. To document this effect,
the following graph images were taken a little while ago from our Cacti
installation:
http://www.jutley.org/DNS
Looking at the 4th graph down, which is the cache statistics on the old
version recursor, you will see that around 90% of all questions are
cache hits, with around 10% as cache misses. And, looking at the third
graph (showing how fast queries are answered), you'll see that over 90%
of all queries are answered in less than 1 ms.
However, looking at the bottom graph, which is the cache statistics on
the new recursor, the statistics are totally different. Only 1.1% of
the total questions are cache hits, while 6.8% are cache misses, which
to me makes no sense, since a question *HAS* to be either a cache hit or
cache miss. And, looking at the 7th graph (answer speed on the new
recursor version), most queries are taking more than 10ms to answer.
Just as additional info, the data collected by cacti to generate these
graphs comes from the following command:
/usr/bin/rec_control get questions cache-entries cache-hits cache-misses
concurrent-queries resource-limits unauthorized-tcp unauthorized-udp
spoof-prevents answers-slow client-parse-errors answers0-1 answers1-10
answers10-100 answers100-1000 qa-latency
Am I mis-interpreting this, or is there something definately going on?
Thanks for your time,
Jeremy
More information about the Pdns-users
mailing list