Monday, August 28, 2006

Glasgow goes down in a cloud of smoke...

Electrical engineers preparing for the new cluster's arrival tomorrow touched one of the bus bars under the floor this afternoon. It promptly exploded in a cloud of oxidised copper. It seems that the plastic housing around the bus bar had become hot and had started to carbonise, shorting the circuit.

They have hacksawed off 20cm from the bus bar to shorten it away from the affected area.

Unfortunately they also managed to damage one of the circuit switches in the main distribution board. This means the 200A supply to the room is down until tomorrow. In addition the repair, tomorrow, to the old breaker (and it is old) is temporary - it will have to be taken down properly and replaced at a later date. More downtime.

In the meantime I managed to move all of the remaining bits of scotgrid-gla onto the unaffected wall sockets and limp back up (lost all the jobs, of course). All the WNs came up without the DHCP server being alive, so they all needed rebooted to pick up their IP addresses. Also discovered that the NAT box (grid01) did not have IP forwarding enabled by default - yet another fiddly bit it's hard to test.

Unfortunately, after coming home, I realised that ssh on the CE didn't come back up properly (the current cluster arrangements are somewhat ad hoc, at best). So we're down until I can fix that tomorrow morning.

1 comment:

Graeme Stewart said...

So the ssh issue turned out to be a bad GSSAPI option which had been left in sshd_config! After that we started to run fine.

I am impressed at LHCb - still in downtime, up for less and 15 minutes and they've already got 20 jobs running!