Ever since ATLAS analysis has been enabled at Tier 2 sites (and the relevant sheaves of AOD files have arrived at our DPM), the Glasgow DPM has been looking increasingly strained.
This first became obvious during the HammerCloud tests for analysis in December, but over January it became increasingly clear that the access patterns of normal analysis jobs, en-mass, are quite enough to make the storage unreliable for other users.
In particular, we had one period where chunks of ATLAS production work died because the DPM was so overloaded.
Looking at the DPM during these periods, it looked like it was a combination of I/O waits and, more significantly, the dpm and srmv2.2 daemons maxing out the CPU.
Last Friday, we tried "optimising" the DPM MySQL backend by taking the dpm offline, and then exporting, dropping, and reimporting the dpm_db and cns_db databases. The InnoDB engine has an issue that it sometimes becomes fragmented, increasing the size of the physical DB file and reducing performance - reimporting from a logical backup usually reduces this fragmentation in the restored DB.
Unfortunately, this reimporting process took far longer than we anticipated---on the order of 5 hours!---and, in the end, resulted in a distinctly unimpressive 10% size reduction in the physical DB.
After bringing things back up again, however, it became clear that the performance hadn't changed much, and that it was most likely that we just needed to give the DPM processes more room to breathe.
Our DPM is considerably underspecced compared to our new worker nodes (which are lovely 8-core machines, at higher clock rates), but, of course, has the benefit of RAIDed storage to give our DB a bit more reliability. So, we decided to take the big step of splitting the DPM across two nodes - the old DPM being moved to a role as "MySQL backend server", and the "new" DPM being a repurposed worker node hosting all the DPM services.
Thanks to cfengine, and the arcane workings of YPF, it isn't too hard to make a node into any other kind of node that we want---the tricky bit, in this case, is swapping the hostnames, so that the "new" DPM still gets to be svr018, while the old DPM gets moved to svr015 (and also hosts our DPM monitoring stuff now).
The new svr018 used to be node310 - the last node in our pool of new worker nodes - which I'd previously taken offline and allowed to drain over the weekend in anticipation of this.
However, thanks to some synchronized administration by Mike and myself, things seemed to go relatively smoothly with the move on Monday, with only an hour of downtime and barely a failed job in sight, despite being full of ATLAS production at the time.
It looks like this also improved our HammerCloud performance, about which more in a later post.