Monday, October 22, 2007

Disk Servers and Power Outages

Your starter for 10TB: how long does it take to do an fsck on 10TB of disk? Answer: about 2 hours.

Which in theory was fine - in at 10am, out of downtime by 12. However, things didn't go quite according to plan.

First problem was that disk038-41, which had been setup most recently, had the weird disk label problem, where the labels had been created with a "1" appended to them. Of course, this wouldn't matter, except that we'd told cfengine to control fstab, based on the older servers (with no such ones), so the new systems could not find their partitions and sat awaiting root intervention. That put those servers back by about an hour.

Secondly, some of the servers were under the impression that they had not checked their disks for ~37 years (mke2fs at the start of the epoch?), so had to be coaxed into doing so by hand, which was another minor hold up.

Third, I had decided to convert all the file systems to ext3, to avoid protect them in the case of power outages. It turns out that making a journal for a 2TB filesystem actually takes about 5 minutes - so 25 minutes for your set of 5.

And, last but not least, the machines needed new kernels, so had to go through a final reboot cycle before they were ready.

The upshot was that we were 15 minutes late with the batch system (disk037 was already running ext3, fortunately), but an hour and 15 minutes late with the SRM. I almost did an EGEE broadcast, but in the end, who's listening Sunday lunchtime? It would just have been more mailbox noise for Monday, and irrelevant by that time anyway.

As the SRMs fill up, of course, disk checks will take longer, so next time I would probably allow 4 hours for fscking, if it's likely.

A few other details:

* The fstab for the disk servers now names partitions explicitly, so no dependence on disk labels.
* The BDII going down hit Durham and Edinburgh with RM failures. Ouch! We should have seen that one coming.
* All the large (now) ext3 partitions have been set to check every 10 mounts or 180 days. The older servers actually had ~360 days of uptime, so if this is an annual event then doing an fsck once a year should be ok.

No comments: