Friday, September 28, 2007

DPM Dies

Our DPM died last night (sad!). It seemed that / got full and this then caused DPM and MySQL to get into a punch-up where all the CPU on the machine was consumed.

Investigating (with help from Paul - thanks!) the culprit seems mostly to be an innod db "auto-extending data file" called ibdata1. This has now reached 2.1GB in size.

There is some advice about how to configure innodb to control these sizes, but as the default MySQL install on SL3 has no default my.cnf file we'll have to create a sensible one of these before being able to customise this.

However, after some further investigation, it's now clear that 2.1GB is in fact the size of our DPM database (the gziped database dumps are now 800MB!). This with 14TB of data. Scaling up to 100TB and the DB will be > 10GB. Having looked at the tables, the obvious candidates to trim are dpm_put_filereq, dpm_get_filereq and dpm_req. These seem to contain historical data, but without timestamps it's pretty useless. These tables contain about 233MB, 460MB and 295MB respectively, which is about half that total DPM database size. [1]

Recovery strategy has been to move /var/lib/mysql and /var/log off root, to a larger partition (in our case /disk). Soft links point out from the original locations.

I shall put in a ticket to the developers about trimming these tables when the data has aged into uselessness. [2]

The warning for other sites is that /var/lib/mysql really needs to live in a relatively large disk partition.

We'll address this problem properly when we upgade to the gLite 3.1 version of DPM. In the meantime we urgently need to alarm on disk space usage on all the servers.

[1] Try:
mysql> user dpm_db; show table status like "dpm_%";

(Thanks Paul.)



Greig A Cowan said...

Similar thing was previously possible with dCache. It also used to keep a history of past SRM transactions, but the tables just became too big. Their subsequent removal was what broke the MonAMI plugin for dCache.

Paul Millar said...

A trivial typo: the "user" in the SQL should have be "use". Also, using "\G" results in more digestible output: