hmm. Freudian? I originally typed 'cluster clue' as the title.
Regular readers will be aware that we run both ganglia and cfengine. However even our wonderful rebuld system (YPF) doesn't quite close off all the holes in the fabric monitoring. case in point - reimaged a few machines and noticed that ganglia wasn't quite right. It'd copied in the right gmond.conf for that group of machines but hadnt checked that it was listed in the main gmetad.conf as a data_source,
Cue a short Perl script (soon to be available on the scotgrid wiki) to do a sanity check, but it;s this sort of non-joinedupness of all the bits that really annoys me about clusters and distributed systems.
Are there any better tools? (is Quattor the savoiur for this type of problem)