Monday, October 08, 2007

RB Corrupts its Database

Our RB (svr023) seriously died today.

I was alerted to a problem by a pheno user who was having trouble submitting jobs. When I checked the RB I found that the root partition was full, choked by massive /var/log/wtmp files and by an extremely large RB database in /var/lib/mysql.

When had reinstalled the RB 2 weeks ago I had ensured that /var/edgwl was in a large disk area, but to have more than 4GB filled up in less than 2 weeks by the other denizens of /var was completely unexpected.

Emergency procedure was then to move /var/log and /var/lib/mysql over to the /disk partition, creating soft links pointing from the old locations.

This seemed to be going ok, but job submission was still failing. When I checked the error log for mysql I got the message:

071008 16:25:32 [ERROR] /usr/sbin/mysqld: Can't open file: 'short_fields.MYI' (errno: 145)

The short_fields table definitely existed. I even checked it against the last version in /var. Logging in to mysql demonstrated that this table had become corrupted.

I toyed with the idea of trying to save the database, however it would probably have left us with an internally inconsistent RB database, as well as orphaned files in the /var/edgwl bookkeeing area.

Reluctantly I decided that the only sensible recourse was to reinstall svr023, and take the hit of the lost jobs.

This has now been done, and normal service has been resumed.

In the course of the re-install the /var partition has been grown to more than 100GB, which should protect us against large log files for quite a time.

Frankly, all very annoying and I'm quite upset that we lost users' jobs.

Sorry folks. It shouldn't happen again.

No comments: