[Pdns-users] pdns_recursor stops getting queries on Solaris 10 sparc

Alex Kiernan alex.kiernan at gmail.com
Wed Sep 12 20:35:27 UTC 2007


On 12/09/2007, Jan Gyselinck <pdns-users at lists.b0rken.net> wrote:
> On Wed, Sep 12, 2007 at 02:18:59PM +0100, Alex Kiernan wrote:
> > I ran into this problem on a live box, so I ended up backing out and
> > going back to bind, but I've grabbed a set of queries which reproduce
> > the problem (eventually).
> >
> > When it stops doing stuff it, it looks like its not getting new queries:
> >
> > port_getn(7, 0x0012D108, 1024, 1, 0xFFBEF44C)   = 0 [62]
> > port_getn(7, 0x0012D108, 1024, 1, 0xFFBEF44C)   = 0 [62]
> > port_getn(7, 0x0012D108, 1024, 1, 0xFFBEF44C)   = 0 [62]
> > port_getn(7, 0x0012D108, 1024, 1, 0xFFBEF44C)   = 0 [62]
> > port_getn(7, 0x0012D108, 1024, 1, 0xFFBEF44C)   = 0 [62]
> > port_getn(7, 0x0012D108, 1024, 1, 0xFFBEF44C)   = 0 [62]
> > port_getn(7, 0x0012D108, 1024, 1, 0xFFBEF44C)   = 0 [62]
> >
> > But it seems like it only happens after ~250K queries. I'm pushing my
> > queries at it using UDP (using perl ParaDNS), once it has given up,
> > its only the UDP queries which break - TCP still works.
> >
> > Any pointers where to start looking?
>
> I've bugged Sun about it, all they did was point to Bert though ;-)
> I only see this happening when using the fork option, and then again
> only when getting a lot of queries.  It happens every couple of weeks,
> sometimes it runs for a couple of months even.  Restarting often doesn't
> change a thing, it looks very much like a race-condition so the less
> queries the less chance you'll see it (in my experience).
>

I made these changes earlier today - its been running for ~6 hours
now, answered 5.3M queries and still hasn't hung:

Index: portsmplexer.cc
===================================================================
RCS file: /cvsroot/upstream/pdns-recursor/portsmplexer.cc,v
retrieving revision 1.1.1.1
diff -u -r1.1.1.1 portsmplexer.cc
--- portsmplexer.cc     12 Nov 2006 16:56:13 -0000      1.1.1.1
+++ portsmplexer.cc     12 Sep 2007 20:29:12 -0000
@@ -89,14 +89,14 @@
   unsigned int numevents=1;
   int ret= port_getn(d_portfd, d_pevents.get(), min(PORT_MAX_LIST,
s_maxevents), &numevents, &timeout);

-  gettimeofday(now,0);
-
   if(ret < 0 && errno!=EINTR && errno!=ETIME)
     throw FDMultiplexerException("completion port_getn returned
error: "+stringerror());

-  if((ret < 0 && errno==ETIME) || numevents==0) // nothing
+  if((ret < 0 && errno==EINTR) || numevents==0) // nothing
     return 0;

+  gettimeofday(now,0);
+
   d_inrun=true;

   for(unsigned int n=0; n < numevents; ++n) {

The gettimeofday move is clearly wrong as it changes the API (I moved
it just in case it was stamping on errno). The other one I think is
the significant one - if we get EINTR, we want to do nothing and go
back around the event loop, if we get ETIME then numevents is valid
(and presumably could be 1), if my reading of the man pages is correct
(and some brief inspection of the source behind it). It also seems to
match better what the other mplexer implementations do.

I'm going to put the gettimeofday call back where it was and start
running tests again.

-- 
Alex Kiernan


More information about the Pdns-users mailing list