Thursday, February 12, 2009

HammerCloud 135---A Load Shared is a Load Halved, to a point.

We performed our splitting of the DPM across two hosts just in time for the most recent HammerCloud test on UK sites:
http://gangarobot.cern.ch/st/test_135/

So, we already have some metrics to compare the old arrangement with the new.
For reference, Graeme blogged about the last big HammerCloud UK test here, where we were getting an event rate of around 10Hz, at the cost of the DPM head node running at an unsustainable load.
Since then, a couple of Hammerclouds have come by, generally coincident with ATLAS production and other stresses on the DPM, and it has just utterly failed to cope.

After our surgery, we did a lot better:




with an event rate of about 14 Hz, a 50% improvement, almost.



and, the load on the DPM head node was very much more acceptable, given the increased power of the hardware:




However, we're still not close to maxing out the pool nodes:



probably because we've hit another, higher, performance bottleneck on the new svr015 "MySQL server" machine:



that orangish stuff is the CPU in I/O Wait state, waiting for seeks within the DB.
We're currently looking at ways of tuning MySQL, or the disk, to improve this performance, since it looks like there's another 30 to 40% of performance there, at least.

Some ideas we've had include splitting the dpm_db and cns_db across different filesystems (since they have very different access patterns for this kind of use), tweaking MySQL settings (although they look generally fine...), or even getting Faster Disks. Roll on solid state drives, we say!

No comments: