[Pdns-users] Enhancing pdns recursor observability

Mon May 13 10:11:46 UTC 2013

Dear PowerDNS Developers,

every now and then one of our internal customer calls and says "this and 
that record doesn't resolve whereas it works when using google opendns 
or dig +trace".
And they are right :-( For example

dig -x  194.95.67.2

pdns_recursor 3.3 sometimes only reports the cname (and a servfail) and 
sometimes both the cname and the queried ptr record are delivered.

I have no idea why 8.8.8.8 always returns the PTR, sometimes even the 
dig +trace fails.

To be able to understand these problems in a live system I would like to 
have some sort of tracing facility in pdns_recursor which can be turned 
on and off without restarting the service.

Ideally pdns_recursor would provide some sort of cli which can be used 
to create output channels, create, list and delete filters.

pcli create output logfile1 '/var/tmp/logfile_servfail'

There should be two types of filters: simple filters matching only 
single log entries ("entry-filter") and filters that output the complete 
transaction if any of the log entries matches ("transaction-filter").

One should be able to create filters on every field

pcli create entry-filter f1 as query='%67.95.194.in-addr.arpa' to 
logfile1, stdout

logentries contain the following information:

* traId: transaction Id: uniquely identify a transaction within a thread
* thrId: thread Id
* Proto: TCP or UDP (shortened to P in example)
* Version: 4 or 6 (shortened to V in example)
* srcIP: no need to explain
* dstIP: no need to explain
* QueryDirection (QD):
   - cQ client query: query received by server from a client
   - sQ server query: query sent by the server to authoritative DNS Servers
   - PC lookup packet cache
   - QC lookup query cache
   - sA server answer: answer received by server
   - cA client answer: answer send to client
* Ty: type of resource asked (A, PTR, RP, ...)
* Query: the question values as string 'google.com', 
'0.63.67.95.194.in-addr.arpa'
* Status: NXDOMAIN, NOERROR, SERVFAIL...
* flags: qr, aa, rd, ra, ...
* Time: for cQ and sQ null, for sA how long an individual query took and 
for cA how long it took from receiving cQ until cA was constructed 
(including wait time in queues)

traId thrId V srcIP   dstIP   P QD Ty  Query              St Fl     time
123   123   4 1.2.2.1 1.0.0.2 U cQ PTR 2.67.95.194.in-...    qr rd
123   123   4 4.0.0.3 2...:35 U sQ PTR 2.67.95.194.in-... NO qr rd
123   123   4 2...:35 4.0.0.3 U sA NS  a.in-addr....arpa. NO qr rd    27
123   123   4 1.0.0.2 1.2.2.1 U cA PTR 2.67.95.194.in-... NO qr rd ra 28

The example does not show the lookups in the packet cache, query cache 
and the wait time in the receive queue. In an ideal world times spent 
there would be shown.

This should be implemented in a fashion where I could run
- entry-filter QD="cA" and status="SERVFAIL"
- entry-filter QD="sA" and time > 500

to send these log entries to a monitoring system where they can be 
aggregated and alarms can be generated.

The transaction-filter will be mainly used to debug why things are 
happening.

Has anyone else sometimes the need to dive deeply into how the recursor 
is working and which server in the outside world are failing?

Is this idea worth opening a wishlist ticket?

Regards Thomas