Friday, October 05, 2007

knotty nat knowledge

hmm, thats odd, why aren't the NAT boxes visible on ganglia?
Seemed a simple enough problem - they used to be there, but for some reason fell off the plots late August.

Had boxes restarted and failed to start gmond? nope - good uptime. gmond running? yep. Telnet to gmond port? yep. hmm.

<CLUSTER NAME="NAT Boxes" ... >
</CLUSTER>

and no HOST or METRIC lines between them. Most odd. After some discussion with Dr Millar it turned out to be a probable issue with the Linux Multicast setup - the kernel wasn't choosing the same interface to listen and send on. Luckily this was patched in a newer version of ganglia - the config file supports the mcast_if parameter to allow explicit setting (in our case to the internal ones).

Sadly of course the out-of-the box RPM doesn't install on SL4 x86_64 - needs unmet dependencies (as normal....) so a quick compile on one of the worker nodes and some dirty-hackery-copying the binary over worked a treat. We now have natbox stats again..

No comments: