[Pdns-users] test driving LMDB backend

Mon May 6 08:02:24 UTC 2019

Hi,

I've been test-driving the new PowerDNS LMDB backend. Even though my 
tests are very basic, I though some of you might be interested in my 
findings.

TL;DR: It's easy to set-up (at least as slave). In my basic set-up, it 
could handle about 7 times the load the MySQL back-end could handle. And 
it starts incredibly fast.

I used a cheap 1CPU 1G RAM VPS for my tests.

For this I compiled the latest source available from github. If anybody 
is interested in instructions on how to get this compiled on CentOS 7, 
let me know.

To enable lmdb, I simply put this in pdns.conf:
launch=lmdb
lmdb-filename=/var/pdns/pdns2.lmdb

After starting PDNS it uses about 60Mb of memory. As expected, this 
remains the same as you start loading up zones into the database. Only 
the disk cache (which the system takes care of) increases.
Once you actually start asking questions, PDNS memory usage does grow. I 
guess at some point, the query cache and packet cache might end up 
cannibalizing the disk cache.

I loaded it with about 25000 slave zones. Mostly small zones, which I 
guess would be typical for a shared DNS-hosting (since that's where they 
come from). The folder in which the LMDB database is kept, only grew to 
67Mb.

If you restart PDNS, it will be responding to requests in less than 1 
second.
I actually found out because systemd was restarting the service every 
couple of minutes. Turns out that I should have put "Type=simple" in the 
.service-file instead of "Type=notify".

As a very basic test to make sure the zone transfers went okay, I 
checked the results of "dig -t AXFR" from the test-server and the 
master-server of 1000 of the domain names and all of them were identical.

LMDB is meant to be quick, so I wanted to do some tests on the load it 
could handle. I used dnsblast for this and kept an eye on the PowerDNS 
Metronome service.

While dnsblast was running, I also had a simple script requesting ever 
changing random subdomains from one of the zones. If it didn't receive 
the correct answer within 1 second, it would print an error.

dnsblast also requests mostly random subdomains (I changed it to request 
subdomains from a domain name in my DB). It should be a reasonably good 
test of what your DB can handle.

I could start dnsblast with 7500 requests per second before a line 
appeared on the Metronome "DB queue" graph. All the requests from my 
separate test-script were still being answered. When I went up to 10000 
requests per second sent by dnsblast, some of the requests of my 
test-script were not answered in time and the DB-queue went up to about 
2000.
Once the first requests need to be queued for the DB, you should be near 
the maximum of the sustained load you'll be able to handle. In this 
case, this seemed to be around 7500 requests per second.

I did the same test with a similar set-up but using the MySQL back-end. 
In that case, at about 1000 requests per second the DB-queue had some 
requests in it and from 1500 requests per second not all the requests 
from my separate script were being answered. Metronome also started 
showing figures in the "Timedout queries" graph.
In this case the maximum of the sustained load seems to be around 1000 
unique requests per second.

The numbers are there more for comparison then actually knowing what 
this back-end can handle. Again: all this on very basic machines, 1CPU 
and 1G RAM. And dnsblast was running on the same machine and using about 
30% of that one CPU...
When at maximum load, it clearly was the CPU that was the bottleneck. So 
if you want to be able to handle a bigger load, adding extra CPU-power 
should be the first priority.

What did confuse me was the reply ratio shown by dnsblast. From about 
300 requests per second and more, the reply rate went below 100% and 
would many times be around 30% or even lower. This seemed to resemble 
the numbers shown by Metronome as "UDP in-error/s". So it looked like 
most requests were not being answered, but at the same time my separate 
script would still get replies within less then 1 second to every single 
one of its requests. Can anybody shed some light on what "UDP 
in-error/s" means?
This was consistent both with MySQL back-end and LMDB.

Regards,
Bart Mortelmans