Sunday, October 21, 2007

Glasgow power outage

Ho Hum, it's that time of year for HV switchgear checking. Rolling programme of work across campus meant that the building housing the glasgow cluster was due for an outage at ungodly-o-clock in the morning. We arranged the outage in advance, booked scheduled downtime. All OK. Then after G had taken some well deserved hols I discovered how dreadful the cic-portal was to send an egee broadcast. I want to tell users of the site it;s going down. Any chance of this in english? RC management? What is 'RC' - doesn't explain it. Then who should I notify? agan no simple descriptions... Grr. Rant..

OK - system went down cleanly easily enough (pdsh -a poweroff or similar) - bringing it back up? hmm. 1st off the LV switchboard needed resetting manually so the UPS has a flat battery. Then one of the PDU's decided to restore all the associated sockets to 'On' without waiting to be told. (so all the servers lept into life before the disks were ready). Then the disk servers decided they needed to fsck (it'd been a year since the last one) - slooooow. Oh, and the disklabels on the system disks were screwed up (/1 and /tmp1 rather than / and /tmp for exmple) - another manual workaround needed.

Finally we were ready to bring the workernodes back - just on the 12:00 deadline. I left graeme still hard at it, but there's a few things we'll need to pull out in a post mortem. I'm sure Graeme will blog some more

No comments: