Wednesday, May 07, 2008

It's all gone 'orribly wrong

OK - My bad. I spotted we failed a sam test yeterday (got a mail from the automated alert) - didn't realise it doesn't send multiple ones if you keep failing....

Am sure Graeme will post more but we'd filled up / (as /var wasn't on a separate partition - it is now, and a nice healthy 30G) on the CE. Puzzlingly nagios hadn't bothered to alert that we'd gone warning at 8% free or critical at 4% free and was "OK - /0 free"

case sensitivity in check_disk: -w is for disk space, -W is for inodes. grr. Typo-tastic. I lowercased the offending config and let cfengine ripple it out. While it did so I noticed cfengine restarted ntpd on the 3 nat boxes (that also act as the timeservers for the cluster) - somehow it was copying both a standard then a local /etc/ntp.conf into place each time and restarting as planned on a new config file.

my bad - we use class 'natboxes' for the group and I'd specified any.!(master|nat):: changing it to any.!(master|natboxes) worked fine - no restart since and none of the workernodes are seeing any upstream timeservers on INIT or LOCAL

