hmm, thats odd, why aren't the NAT boxes visible on ganglia?
Seemed a simple enough problem - they used to be there, but for some reason fell off the plots late August.
Had boxes restarted and failed to start gmond? nope - good uptime. gmond running? yep. Telnet to gmond port? yep. hmm.
<CLUSTER NAME="NAT Boxes" ... >
</CLUSTER>
and no HOST or METRIC lines between them. Most odd. After some discussion with Dr Millar it turned out to be a probable issue with the Linux Multicast setup - the kernel wasn't choosing the same interface to listen and send on. Luckily this was patched in a newer version of ganglia - the config file supports the mcast_if parameter to allow explicit setting (in our case to the internal ones).
Sadly of course the out-of-the box RPM doesn't install on SL4 x86_64 - needs unmet dependencies (as normal....) so a quick compile on one of the worker nodes and some dirty-hackery-copying the binary over worked a treat. We now have natbox stats again..
No comments:
Post a Comment