[Pdns-users] clarification of powerdns recursor statistics, input for munin-plugin

Marc Haber mh+pdns-users at zugschlus.de
Tue Feb 15 09:51:19 UTC 2011


Hi,

I have written a "small" (~800 lines of perl) plugin for munin to
viusalize the statistics written out by the pdns recursor by means of
rec_control get-all. Example output of the plugin can be downloaded
for some short period of time from
http://q.bofh.de/~mh/stuff/munin-pdns-recursor.pdf

I would like to (a) solicit your feedback about how output could be
improved, and (b) ask for clarification what some of the data points
mean. I am sure that some of the datapoints can be plotted to the same
graph such as the "Outresponses" graphs consolidate the different
types of answers.

On IRC, I learned the following terminology:

A Query is a DNS message containing a DNS query. Another word for this
would be "Question".

A Response is a DNS message containng a DNS response. Other word for
this are "Reply" and "Answer". An answer can contain multiple records.

Within UDP transport, both Queries and Responses are "Packets",
formally "UDP datagrams"

Within TCP transport, both Queries and Responses are "Messages", which
are actually Packets with a Length field.

An Inquery is a query that is received by the recursor from a client.

An Outquery is a query that is sent out by the recursor in order to
collect data necessary to answer an Inquery.

An Inresponse is a response that is received by the recursor as
response to an Outquery.

An Outresponse is a response that is sent out by the recursor as
response to an Inquery.

A Query that has been sent, but the associated Response has not been
received is called "Outstanding". If a Query is Outstanding for too
long, it is "Timed Out".

I have been trying to group the data points so that it makes sense. My
documentation in the rest of this message may be wrong, and I'd like
you to correct me if I'm doing things wrong. Things written in
brackets [] are questions that have surfaced during my work. My
ultimate goal is to produce documentation that is suitable to improve
http://doc.powerdns.com/recursor-stats.html at some time in the
future.

All values (until stated differently) are counters starting at zero
when the PowerDNS Recursor is started, and roll over after exceeding
the machine word size. Values labeled "Current" reflect the current
value of a non-counter value.

Outqueries:
  Counts:
    all-outqueries: outgoing UDP queries
      this is kind of a misnomer and should be called udp-outqueries
    tcp-outqueries: outgoing TCP queries
    ipv6-outqueries: outgoing queries over IPv6
    dont-outqueries: outgoing queries dropped because of 'dont-query' setting (since 3.3)
    [Where is "ipv4-outqueries"? Do "all-outqueries" and
     "tcp-outqueries" include outqueries sent via IPv6? I'd like to see
     something more orthogonal like "ipv4-outqueries", "ipv6-outqueries",
     "tcp-outqueries", "udp-outqueries", "all outqueries", or maybe even
     more granular like "{tcp,udp}-ipv{4,6}-outqueries"]

  outgoing-timeouts
    counts the number of timeouts on outgoing UDP queries since starting
  
  unexpected-packets 
    unexpected Inresponses (might point to spoofing)

  [would it make sense to have data points that show the distribution
   of the answer times out Outqueries experience?]

Outresponses
  by Time and Source:
    answers0-1: queries answered within 1 millisecond
    answers1-10: queries answered within 10 milliseconds
    answers10-100: queries answered within 100 milliseconds
    answers100-1000: queries answered within 1 second
    answers-slow: queries answered after > 1 second
    packetcache-hits: Packet cache hits
    over-capacity-drops: Queries dropped because over maximum concurrent query limit
    resource-limits: queries that could not be performed because of resource limits
    questions: all End-user initiated queries with the RD bit set
    [which fields sum up to questions, which fields are misplaced in this
     graph?]
    ['answers' is a misnomer, maybe "response" values should be added
     and the "answers" versions deprecated.]
    [in the sample PDF, packetcache-hits, over-capacity-drops, questions
     are still plotted independently]
    [questions will be plotted as a line which should match the top line
     of the areas]
  by Type:
    noerror-answers: sent NOERROR outresponses
    nxdomain-answers: sent NXDOMAIN outresponses
    servfail-answers: sent SERVFAIL outresponses
    unauthorized-tcp: TCP queries denied because of allow-from restrictions
    unauthorized-udp: UDP queries denied because of allow-from restrictions

Parse Errors:
  server-parse-errors
    server replied packets that could not be parsed
    [Which server? We or the servers we query? Are we talking about
     Inqueries or Inresponses?]

Latency:
  qa-latency
    shows the current latency average, in microseconds
    [is this the latency we experience with our Outqueries, or the
     latency we produce for our Inqueries? Shouldn't there be a value
     for the corresponding other latency, and what does the qa- prefix
     mean?]

TCP:
  tcp-client-overflow: times an IP address was denied TCP access because it already had too many connections
  tcp-questions: incoming TCP queries
  [do those fit anywhere else or do both values need their own graph;
  where is udp-questions? is there also udp-client-overflow? The
  "times and _IP_address_ was denied" is kind of unique, why
  not simply count "queries that were denied TCP access
  beause...?"]

Records:
  dlg-only-drops: records dropped because of delegation only setting

Query Cache:
  Hit/Miss:
    cache-hits: cache hits
    cache-misses: cache misses
  Bytes:
    cache-bytes: Current size of the Query cache in bytes
  Entries:
    cache-entries: Current number of entries in the Query cache

Packet Cache:
  Hit/Miss:
    packetcache-hits: Packet cache hits
    packetcache-misses: Packet cache misses
  Bytes:
    packetcache-bytes: Current size of the Packet cache in bytes
  Entries:
    packetcache-entries: Current number of entries in the Packet cache

Negative Answer Cache:
  negcache-entries: Current number of entries in the Negative answer cache
  [shouldn't there be a negcache-size as well?]

MThreads:
  concurrent-queries
    Current number of MThreads running
  max-mthread-stack
    maximum amount of thread stack ever used

Process Stats:
  CPU time for PowerDNS
    sys-msec: CPU milliseconds spent in 'system' mode
    user-msec: CPU milliseconds spent in 'user' mode
  uptime
    Wall Time seconds since the recursor was started
    [this is plotted wrong in the sample PDF]

Throttle:
  throttled-out: throttled outgoing UDP queries
  throttle-entries: Current number of entries in the throttle map
  [is there also a value for throttled outgoing TCP queries? How does
  the magnitude of these values usually correspond? Does it make sense
  to plot both into the same graph?]

Invalidations:
  nsset-invalidations
    nssets dropped because they stopped working



I need more explanation for these values:
  chain-resends: number of queries chained to existing outstanding query
    number of [in|out]queries chained (what?) to exististing
    outstanding [in|out]queries?
  
  client-parse-errors: counts number of client packets that could not be parsed
    How often is this supposed to happen, and in which circumstances
    short of attacks or network errors?

  EDNS:
    edns_ping_matches:
    edns_ping_mismatches: does this relate to
    draft-hubert-ulevitch-edns-ping-01? I guess it counts outgoing
      EDNS pings and the corresponding answers. When does the recursor
      send out a ping?
    noedns_outqueries: is that a special kind of outquery? is "no" the
    opposite of "yes" or short for "number of"

  nsspeeds:
    nsspeeds-entries: entries in the NS speeds map
    [this is not mentioned outside the statistics chapter at all]

  spoof-prevents
    number of times PowerDNS considered itself spoofed, and dropped the data
    [does this relate to Inreplies?]



Thanks for reading through this, and thanks for your answers.

Greetings
Marc

-- 
-----------------------------------------------------------------------------
Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany  |  lose things."    Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature |  How to make an American Quilt | Fax: *49 3221 2323190



More information about the Pdns-users mailing list