[Pdns-users] TCP Queries stop - can only fix with restart?

Tue Oct 18 04:55:20 UTC 2005

Greetings,

We have two servers with the identical configuration, both running
PowerDNS as an Authorative name server, fetching it's data from the
stock MySQL tables provided. Each nameserver box has it's own MySQL 
server running locally which has data replicated to it from a Master 
Mysql server elsewhere in the mix.

Each box is also acting as a recursive server using PowerDNS's internal 
recursion server. Powerdns is listening on about 250 ip's on both boxes, 
TCP and UDP queries.

Oct 17 22:51:36 ns1 pdns[5800]: UDP server bound to xxx.xxx.xxx.xxx:53 
<250 some ip's>
Oct 17 22:51:35 ns1 pdns[5800]: UDP server bound to 127.0.0.1:53
Oct 17 22:51:36 ns1 pdns[5800]: TCP server bound to xxx.xxx.xxx.xxx:53 
<250 some ip's>
Oct 17 22:51:36 ns1 pdns[5800]: TCP server bound to 127.0.0.1:53
Oct 17 22:51:36 ns1 pdns[5800]: Set effective group id to 407
Oct 17 22:51:36 ns1 pdns[5800]: Set effective user id to 1001
Oct 17 22:51:36 ns1 pdns[5800]: DNS Proxy launched, local port 33484, 
remote 127.0.0.1:5300
Oct 17 22:51:36 ns1 pdns[5800]: Creating backend connection for TCP
Oct 17 22:51:36 ns1 pdns[5800]: Master/slave communicator launching

It all starts fine, but every couple of days TCP auth/recurse queries 
seem to cease functioning, while UDP are still working fine with the 
following error:

Oct 17 23:02:10 ns1 pdns[5800]: TCP nameserver had error, cycling 
backend:EOF trying to get length of answer from remote TCP server
Oct 17 23:02:21 ns1 pdns[5800]: TCP server is without backend 
connections, launching

At least I think that error has something to do with it.

It seems that I'm able to simply restart powerdns and the issue goes 
away, but that can't be the proper solution for this.

At the time that the server died out, I ran netstat -an, and in 
condensed form, this was the result.

- 1039 total tcp connections at the time
- 874 of them were close wait
- 165 of them were established

and this to contrast, is the only output netstat -an gives me when i run 
it during the server "working properly"

tcp 34     0 127.0.0.1:5300          127.0.0.1:35734         CLOSE_WAIT
tcp 0      0 xxx.xxx.xxx.xxx:53      xxx.xxx.xxx.xxx:19730   ESTABLISHED
tcp 0      0 xxx.xxx.xxx.xxx:53      xxx.xxx.xxx.xxx:1943    TIME_WAIT
tcp 0      0 xxx.xxx.xxx.xxx:53      xxx.xxx.xxx.xxx:1194    ESTABLISHED

Has anyone encountered anything like this before? Anyone have any ideas
on how to fix it? My boss is going nuts and so am I trying to figure 
this out! :)

Some Background Info:
=====================
OS: Gentoo 2005.1 (emerge --sync as of a few days ago)
KERNEL: 2.6.12-gentoo-r9
RAM: 1GB
CPU:  Intel(R) Xeon(TM) CPU 3.06GHz
Using NPTL, but not NPTLONLY

vmstat output when functioning properly:
========================================
procs -----------memory---------- ---swap-- -----io---- --system-- 
----cpu----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us 
sy id wa
  0  0   2436 139072 213296 271544    0    0     0     3    3     7  2 
1 93  4
  0  0   2436 139072 213296 271612    0    0     0    44 1106   498  2 
1 98  1
  0  0   2436 139072 213296 271612    0    0     0    52 1114   438  1 
1 99  0
  1  0   2436 139048 213296 271612    0    0     0  1956 1223   993  3 
2 94  2
  0  0   2436 139048 213296 271612    0    0     0  1712 1238   701  1 
1 85 12
  0  0   2436 139048 213296 271612    0    0     0    64 1189   693  2 
1 97  0
  0  1   2436 138676 213296 271612    0    0     0  3612 1309   842  6 
3 82  8
  0  0   2436 138676 213296 271680    0    0     0    68 1211   546  1 
1 93  5
  0  0   2436 138676 213296 271680    0    0     0    40 1180   500  1 
1 97  0
  0  0   2436 138692 213296 271748    0    0     0    44 1237   686  1 
1 98  0
  0  0   2436 138692 213296 271748    0    0     0   416 1177   546  1 
1 97  1
  0  0   2436 138692 213296 271748    0    0     0    40 1201   459  1 
2 98  0
  0  0   2436 138692 213296 271748    0    0     0    44 1225   862  2 
2 97  0
  0  0   2436 138708 213296 271748    0    0     0    48 1239   864  2 
1 97  0
  0  0   2436 138708 213296 271748    0    0     0    52 1168   737  1 
1 97  0
  2  0   2436 138708 213296 271748    0    0     0    48 1162   711  1 
2 97  0

iptables config
===============
#!/bin/sh

iptables=/sbin/iptables

# Flush all tables
#
$iptables -t nat -F
$iptables -t mangle -F
$iptables -t filter -F

# Delete all user defined tables
#
$iptables -X

# Set default chain policies
#
$iptables -P INPUT DROP
$iptables -P FORWARD DROP
$iptables -P OUTPUT ACCEPT

# Allow all to/from loopback interface
#
$iptables -A INPUT -i lo -j ACCEPT
$iptables -A OUTPUT -o lo -j ACCEPT

# Allow all to/from internal network interface
#
$iptables -A INPUT -i eth1 -j ACCEPT
$iptables -A OUTPUT -o eth1 -j ACCEPT

# Allow all established and/or related connections
#
$iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
$iptables -A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

# Allow in
#
#$iptables -A INPUT -i eth2 -p tcp --dport 22 -m state --state NEW -j ACCEPT
$iptables -A INPUT -i eth2 -p udp --dport 53 -m state --state NEW -j ACCEPT
$iptables -A INPUT -i eth2 -p tcp --dport 53 -m state --state NEW -j ACCEPT
$iptables -A INPUT -i eth2 -p tcp --dport 873 -m state --state NEW -j ACCEPT

# Allow out
#
$iptables -A OUTPUT -o eth2 -p tcp --dport 22 -m state --state NEW -j ACCEPT
$iptables -A OUTPUT -o eth2 -p udp --dport 53 -m state --state NEW -j ACCEPT
$iptables -A OUTPUT -o eth2 -p tcp --dport 873 -m state --state NEW -j 
ACCEPT

/etc/sysctl.cnf
===============
net.ipv4.tcp_keepalive_time = 120
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_no_metrics_save = 1
net.core.netdev_max_backlog = 2500

and ran this
ifconfig eth2 txqueuelen 1000

/etc/recursor.conf
==================
setuid=nobody
setgid=nobody
quiet=on
local-address=127.0.0.1
local-port=5300
max-tcp-clients=1024

/etc/pdns.conf
==============
cache-ttl=60
daemon=yes
distributor-threads=10
launch=gmysql
gmysql-host=localhost
gmysql-user=xxxx
gmysql-password=xxxxxxx
gmysql-dbname=xxxxxxx
local-address=<INTERNAL IP>,<250 some external ip's comma seperated>
log-dns-details=yes
log-failed-updates=yes
logfile=pdns.log
logging-facility=0
loglevel=3
master=yes
query-cache-ttl=60
query-logging=yes
recursor=127.0.0.1:5300
setgid=407
setuid=1001
webserver=no

Thanks,
Matt Gibson