Thursday, August 13, 2009

Database backups, and lock time

Running a service creates data. Running a service for a long time creates lots of data.

In this case, the WMS and LB servers - we're sitting with about 18GB on each LB. This is not a problem - they're well indexed against the usual queries (out of the box, no fiddling required), so the old data isn't really noticed.

Until you take a backup.

Then, in order to get a consistant backup, it's locked for however long it takes to dump all that data. Which is about 45 minutes.

That's too long - it means we have some time when it's not available, it's getting noticed. So, how can we take a backup, without locking the database for so long?

There's various options for that, but the best looking (read: simplest) one is to enable binary logging in MySQL. Because the tables used are all InnoDB, which is transactional, this means that the backup can mark a position in the log, and then use that to _not_ backup operations that came after it - which results in a consistant backup. (If your using any MyISM tables, which are not transactional, you can't do this. Hence the use of LVM snapshoting or other exotic techniques).

This it really simple: in the my.cnf for each service, put 'log-bin' (without the quotes) in the [mysqld] section, and restart.

Binary logging is now enabled.

Next, to take a lock free [0] dump, add the --single-transaction flag to mysqldump.

The time taken to actually dump the data to disk won't change, but the database won't be locked for that time.

I did this for one of our LB servers, and then, while the dump was running, submited a job through the WMS. The job was assigned to the LB I was dumping, proving it can be written to, and has now completed, while the dump hasn't yet finished.

I've modified our usual backup script, so that if it detects the presence of /var/lib/mysql/${hostname}-bin.index, which is the index for the binary log, it automatically uses --single-transaction. That way, we still have a single backup script, but it does it the best way possible.

There are a couple of downsides to binary logging: It means the DB has to write more data to disk, so is about 1% slower. As the services are not running at 99% of the cpu, that's ok for us. It also means that each new piece of data is stored twice - once in the DB, and once in the log. Therefore the data storage need grows twice as fast - faster, if there are deletes to the database. I'm looking at an 18GB database - so this won't be a problem. Also, you can purge old logs, so I don't feel that this is a problem any more than the risk of the database expanding over the partition size is.

One thing I'll be looking at is useing the binary logs to take an incremental backup. That'll still not lock the database, but will also be much smaller and faster to take. That's a bit more complicated to arrange, so it'll go into the pile of 'ideas that look nice, but we don't think we need it yet'

As an aside, I think this has to go down as one of the more anticlimatic updates - it was simple, quick and just worked. Unless disk space is very tight, I can't see why one wouldn't enable it.

[0] Technically, it takes a lock, waits for all pending transactions to complete, marks the log position, then releases it. If you have slow operations in flight, it locks it for the duration of that operation.

No comments: