[Pdns-users] clarification of powerdns recursor statistics, input for munin-plugin
Marc Haber
mh+pdns-users at zugschlus.de
Tue Feb 15 09:51:19 UTC 2011
Hi,
I have written a "small" (~800 lines of perl) plugin for munin to
viusalize the statistics written out by the pdns recursor by means of
rec_control get-all. Example output of the plugin can be downloaded
for some short period of time from
http://q.bofh.de/~mh/stuff/munin-pdns-recursor.pdf
I would like to (a) solicit your feedback about how output could be
improved, and (b) ask for clarification what some of the data points
mean. I am sure that some of the datapoints can be plotted to the same
graph such as the "Outresponses" graphs consolidate the different
types of answers.
On IRC, I learned the following terminology:
A Query is a DNS message containing a DNS query. Another word for this
would be "Question".
A Response is a DNS message containng a DNS response. Other word for
this are "Reply" and "Answer". An answer can contain multiple records.
Within UDP transport, both Queries and Responses are "Packets",
formally "UDP datagrams"
Within TCP transport, both Queries and Responses are "Messages", which
are actually Packets with a Length field.
An Inquery is a query that is received by the recursor from a client.
An Outquery is a query that is sent out by the recursor in order to
collect data necessary to answer an Inquery.
An Inresponse is a response that is received by the recursor as
response to an Outquery.
An Outresponse is a response that is sent out by the recursor as
response to an Inquery.
A Query that has been sent, but the associated Response has not been
received is called "Outstanding". If a Query is Outstanding for too
long, it is "Timed Out".
I have been trying to group the data points so that it makes sense. My
documentation in the rest of this message may be wrong, and I'd like
you to correct me if I'm doing things wrong. Things written in
brackets [] are questions that have surfaced during my work. My
ultimate goal is to produce documentation that is suitable to improve
http://doc.powerdns.com/recursor-stats.html at some time in the
future.
All values (until stated differently) are counters starting at zero
when the PowerDNS Recursor is started, and roll over after exceeding
the machine word size. Values labeled "Current" reflect the current
value of a non-counter value.
Outqueries:
Counts:
all-outqueries: outgoing UDP queries
this is kind of a misnomer and should be called udp-outqueries
tcp-outqueries: outgoing TCP queries
ipv6-outqueries: outgoing queries over IPv6
dont-outqueries: outgoing queries dropped because of 'dont-query' setting (since 3.3)
[Where is "ipv4-outqueries"? Do "all-outqueries" and
"tcp-outqueries" include outqueries sent via IPv6? I'd like to see
something more orthogonal like "ipv4-outqueries", "ipv6-outqueries",
"tcp-outqueries", "udp-outqueries", "all outqueries", or maybe even
more granular like "{tcp,udp}-ipv{4,6}-outqueries"]
outgoing-timeouts
counts the number of timeouts on outgoing UDP queries since starting
unexpected-packets
unexpected Inresponses (might point to spoofing)
[would it make sense to have data points that show the distribution
of the answer times out Outqueries experience?]
Outresponses
by Time and Source:
answers0-1: queries answered within 1 millisecond
answers1-10: queries answered within 10 milliseconds
answers10-100: queries answered within 100 milliseconds
answers100-1000: queries answered within 1 second
answers-slow: queries answered after > 1 second
packetcache-hits: Packet cache hits
over-capacity-drops: Queries dropped because over maximum concurrent query limit
resource-limits: queries that could not be performed because of resource limits
questions: all End-user initiated queries with the RD bit set
[which fields sum up to questions, which fields are misplaced in this
graph?]
['answers' is a misnomer, maybe "response" values should be added
and the "answers" versions deprecated.]
[in the sample PDF, packetcache-hits, over-capacity-drops, questions
are still plotted independently]
[questions will be plotted as a line which should match the top line
of the areas]
by Type:
noerror-answers: sent NOERROR outresponses
nxdomain-answers: sent NXDOMAIN outresponses
servfail-answers: sent SERVFAIL outresponses
unauthorized-tcp: TCP queries denied because of allow-from restrictions
unauthorized-udp: UDP queries denied because of allow-from restrictions
Parse Errors:
server-parse-errors
server replied packets that could not be parsed
[Which server? We or the servers we query? Are we talking about
Inqueries or Inresponses?]
Latency:
qa-latency
shows the current latency average, in microseconds
[is this the latency we experience with our Outqueries, or the
latency we produce for our Inqueries? Shouldn't there be a value
for the corresponding other latency, and what does the qa- prefix
mean?]
TCP:
tcp-client-overflow: times an IP address was denied TCP access because it already had too many connections
tcp-questions: incoming TCP queries
[do those fit anywhere else or do both values need their own graph;
where is udp-questions? is there also udp-client-overflow? The
"times and _IP_address_ was denied" is kind of unique, why
not simply count "queries that were denied TCP access
beause...?"]
Records:
dlg-only-drops: records dropped because of delegation only setting
Query Cache:
Hit/Miss:
cache-hits: cache hits
cache-misses: cache misses
Bytes:
cache-bytes: Current size of the Query cache in bytes
Entries:
cache-entries: Current number of entries in the Query cache
Packet Cache:
Hit/Miss:
packetcache-hits: Packet cache hits
packetcache-misses: Packet cache misses
Bytes:
packetcache-bytes: Current size of the Packet cache in bytes
Entries:
packetcache-entries: Current number of entries in the Packet cache
Negative Answer Cache:
negcache-entries: Current number of entries in the Negative answer cache
[shouldn't there be a negcache-size as well?]
MThreads:
concurrent-queries
Current number of MThreads running
max-mthread-stack
maximum amount of thread stack ever used
Process Stats:
CPU time for PowerDNS
sys-msec: CPU milliseconds spent in 'system' mode
user-msec: CPU milliseconds spent in 'user' mode
uptime
Wall Time seconds since the recursor was started
[this is plotted wrong in the sample PDF]
Throttle:
throttled-out: throttled outgoing UDP queries
throttle-entries: Current number of entries in the throttle map
[is there also a value for throttled outgoing TCP queries? How does
the magnitude of these values usually correspond? Does it make sense
to plot both into the same graph?]
Invalidations:
nsset-invalidations
nssets dropped because they stopped working
I need more explanation for these values:
chain-resends: number of queries chained to existing outstanding query
number of [in|out]queries chained (what?) to exististing
outstanding [in|out]queries?
client-parse-errors: counts number of client packets that could not be parsed
How often is this supposed to happen, and in which circumstances
short of attacks or network errors?
EDNS:
edns_ping_matches:
edns_ping_mismatches: does this relate to
draft-hubert-ulevitch-edns-ping-01? I guess it counts outgoing
EDNS pings and the corresponding answers. When does the recursor
send out a ping?
noedns_outqueries: is that a special kind of outquery? is "no" the
opposite of "yes" or short for "number of"
nsspeeds:
nsspeeds-entries: entries in the NS speeds map
[this is not mentioned outside the statistics chapter at all]
spoof-prevents
number of times PowerDNS considered itself spoofed, and dropped the data
[does this relate to Inreplies?]
Thanks for reading through this, and thanks for your answers.
Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature | How to make an American Quilt | Fax: *49 3221 2323190
More information about the Pdns-users
mailing list