it's been long overdue on the TODO list but we finally got nagios nrpe installed and configured on the worker nodes. We're now checking for locally logged in users (should only be sysadmin staff locally), high loads, processes, zombies and most importantly disk free.
Few pointers that may help others. 1) cfengine splays for 30 mins. This means if you enable a check before the plugsins are pushed out to the node it fills your mailbox. 2) if you normally use
then you'll find your testing runs on ALL workernodes. use host_name node001 (or equiv) for testing new services.
3) cfengine saves you pushing the same config out manually - and it also has the nice side effect of restarting nrpe (a necessary process) automatically when it realises nrpe.cfg has changed