Monday, June 23, 2008

Durham SE Issues

Durham suffered a complete SE failure last week. A RAID card failure took down the old SE, gallows, and then an LVM metadata corruption took out the new disk server on se01.

The list of lost ATLAS files has been reported (https://savannah.cern.ch/bugs/?38037) and we're waiting for the catalog to be cleaned up to restart production here (well, when there are any jobs to run).

We took the opportunity to retire gallows and now se01 is the sole SE at Durham. It should suffice for ATLAS production where we only need a few TB cache anyway.

In the meantime there was a power outage in the Durham machine room over the weekend. David had to get the university to reset some breakers but things seem to be running well now.

No comments: