[Pdns-users] pdns_recursor stops getting queries on Solaris 10 sparc

Alex Kiernan alex.kiernan at gmail.com
Fri Sep 14 12:51:19 UTC 2007


On 14/09/2007, Alex Kiernan <alex.kiernan at gmail.com> wrote:
> On 14/09/2007, Alex Kiernan <alex.kiernan at gmail.com> wrote:
> > On 13/09/2007, bert hubert <bert.hubert at netherlabs.nl> wrote:
> > > On Thu, Sep 13, 2007 at 03:16:12PM +0100, Alex Kiernan wrote:
> > > > I've run another 8M queries through it in a test environment, and put
> > > > it onto a live box where its been up for 5 hours, answered over a
> > > > million queries and I've not seen a problem, so I'm hoping that this
> > > > is the right fix. Certainly I've not managed to see it running live
> > > > for anything like this long previously.
> > >
> > > Alex,
> > >
> > > I've committed a slightly different patch which appears to work on our
> > > T2000, but can you verify?
> > >
> >
> > I'll give it a go - from a quick look it looks like it differs only in
> > the handling of ETIME, which given we're only retrieving a single
> > event, I don't think the case of (ret==-1, errno==ETIME, numevents==1)
> > can occur, but I'll be honest its not a hole I'd like to leave.
>
> Gave up almost immediately... :(
>
> I have to admit to being a bit surprised - I'll dig some more.
>

Prepare to be surprised... I added instrumention so the code looked like this:

  int ret= port_getn(d_portfd, d_pevents.get(), min(PORT_MAX_LIST,
s_maxevents), &numevents, &timeout);

  int e = errno;
  if (ret !=0) {
    L<<Logger::Error<<"1:ret="<<ret<<",errno="<<e<<",numevents="<<numevents<<endl;
  }
  errno = e;
  gettimeofday(now,0);
  e = errno;
  if (ret !=0) {
    L<<Logger::Error<<"2:ret="<<ret<<",errno="<<e<<",numevents="<<numevents<<endl;
  }
  errno = e;

  if(ret < 0) {

And set my tests running - initially I see a few:

Sep 14 12:21:15 1:ret=-1,errno=62,numevents=0
Sep 14 12:21:15 2:ret=-1,errno=62,numevents=0

type messages, then it all goes silent as I get the workload pushed up
- then, when it all goes wrong, I see:

Sep 14 12:35:34 1:ret=-1,errno=62,numevents=2
Sep 14 12:35:34 2:ret=-1,errno=62,numevents=2

i.e. timer expired, with two events to process, but the current
comitted code doesn't handle those two events because it got an ETIME
(certainly this a bizarre API - I can't think of another UNIX API
where ret == -1, errno == E... means partial success). Looks like
that's when the man page means by "desired" - you can still get up to
max delivered (and a timeout at the same time!).

-- 
Alex Kiernan


More information about the Pdns-users mailing list