Saturday, June 14, 2008

Publish and be damned...

It all went pear shaped yesterday as information publishing fell over on the SE. It seems when I quickly "fixed" the DPM information provider script to get the correct hostname I forgot to chomp() the output from "hostname -f". So the hostname variable had a trailing newline which corrupted the information system. The BDII logs started to throw errors like "First line of LDIF entry does not begin with 'dn:' at /opt/glite/libexec/glite-info-generic line 17".

Unfortunately the BDII then considered the whole of the SE information package corrupt (rather than just that provider's output) and our SE promptly dissappeared from the information system with the attendent RM test failures.

This situation then persisted for most of the day until Andrew noticed it "by eye". So we had another failure - nagios didn't send an alarm properly when we started to fail. If that had happened it would have been fixed in a hour, but instead we were failing for 8 hours.

From the dizzy heights of SAM perfection we fell to 98% for the month, 95% for the week. It wasn't quite hubris, but it was ironic that I was blogging about Glasgow's reliability at the very moment we were broken.

At the moment I have removed the info provider for tokens, and I will more carefully put it back on Tuesday.

