Friday, March 12, 2010

NATs Maxing Out

During our investigation of our LHCb failures we noticed that our number of conntrack entries on our two NAT hosts were in fact being totally used up i.e. all 43200! By looking at /proc/net/ip_conntrack we noticed that most of the connections were in fact udp DNS lookups by Camont jobs. We also noticed that we had not changed the default timeouts, 32768 for tcp and 3600 for udp. This was probably the reason they were being used up. So we have tweaked the timeouts and increased the maximum.
So our new NAT settings look like this:

original values of 43200, 32768, 3600 respectively.
net.ipv4.netfilter.ip_conntrack_tcp_timeout_established = 21600
net.ipv4.netfilter.ip_conntrack_max = 65536
net.ipv4.netfilter.ip_conntrack_udp_timeout = 30

Now out NAT's look much healthier. Only problem - it didn't help with LHCb productions jobs not being able to upload their results back to CERN. Back to the drawing board.

No comments: