Wednesday, January 14, 2009

DNS goes wibble wobble...

Various funny things were happening today:
  • General sickness in the atlas pilot factory.
  • Quite a few BDII dropouts.
  • SAM test failures from the above.
  • Sluggish clients on our UIs.
  • Very slow logins from CERN.
All things that pointed towards a slow/failing DNS. When I wrote a little test script with 30 forward/reverse DNS queries it was taking 20-50s on some servers and 0.5s on others.

The slow ones had been configured to look at a dnsmasq cache on our headnode, which for unknown reasons was going very slowly (even a restart did not help).

I reconfigured to take out the cache and suddenly all was rosy again across the cluster.

Curiously we had added the cache to overcome problems with campus DNS in the first place.

At least with things configured via cfengine this is a very easy change to make right across the cluster.

