[Pdns-users] Enhancing pdns recursor observability
miesi at pc-h.de
Mon May 13 10:11:46 UTC 2013
Dear PowerDNS Developers,
every now and then one of our internal customer calls and says "this and
that record doesn't resolve whereas it works when using google opendns
or dig +trace".
And they are right :-( For example
dig -x 184.108.40.206
pdns_recursor 3.3 sometimes only reports the cname (and a servfail) and
sometimes both the cname and the queried ptr record are delivered.
I have no idea why 220.127.116.11 always returns the PTR, sometimes even the
dig +trace fails.
To be able to understand these problems in a live system I would like to
have some sort of tracing facility in pdns_recursor which can be turned
on and off without restarting the service.
Ideally pdns_recursor would provide some sort of cli which can be used
to create output channels, create, list and delete filters.
pcli create output logfile1 '/var/tmp/logfile_servfail'
There should be two types of filters: simple filters matching only
single log entries ("entry-filter") and filters that output the complete
transaction if any of the log entries matches ("transaction-filter").
One should be able to create filters on every field
pcli create entry-filter f1 as query='%67.95.194.in-addr.arpa' to
logentries contain the following information:
* traId: transaction Id: uniquely identify a transaction within a thread
* thrId: thread Id
* Proto: TCP or UDP (shortened to P in example)
* Version: 4 or 6 (shortened to V in example)
* srcIP: no need to explain
* dstIP: no need to explain
* QueryDirection (QD):
- cQ client query: query received by server from a client
- sQ server query: query sent by the server to authoritative DNS Servers
- PC lookup packet cache
- QC lookup query cache
- sA server answer: answer received by server
- cA client answer: answer send to client
* Ty: type of resource asked (A, PTR, RP, ...)
* Query: the question values as string 'google.com',
* Status: NXDOMAIN, NOERROR, SERVFAIL...
* flags: qr, aa, rd, ra, ...
* Time: for cQ and sQ null, for sA how long an individual query took and
for cA how long it took from receiving cQ until cA was constructed
(including wait time in queues)
traId thrId V srcIP dstIP P QD Ty Query St Fl time
123 123 4 18.104.22.168 22.214.171.124 U cQ PTR 126.96.36.199.in-... qr rd
123 123 4 188.8.131.52 2...:35 U sQ PTR 184.108.40.206.in-... NO qr rd
123 123 4 2...:35 220.127.116.11 U sA NS a.in-addr....arpa. NO qr rd 27
123 123 4 18.104.22.168 22.214.171.124 U cA PTR 126.96.36.199.in-... NO qr rd ra 28
The example does not show the lookups in the packet cache, query cache
and the wait time in the receive queue. In an ideal world times spent
there would be shown.
This should be implemented in a fashion where I could run
- entry-filter QD="cA" and status="SERVFAIL"
- entry-filter QD="sA" and time > 500
to send these log entries to a monitoring system where they can be
aggregated and alarms can be generated.
The transaction-filter will be mainly used to debug why things are
Has anyone else sometimes the need to dive deeply into how the recursor
is working and which server in the outside world are failing?
Is this idea worth opening a wishlist ticket?
More information about the Pdns-users