Friday, December 19, 2008

Confessions of a Data Management Systems Manager

After my cunningly timed arrival at Glasgow, barely two weeks before the start of Christmas Break (actually, I suspect they call it "Winter Break" now, although "Io Saturnalia!" would be both more fitting and more amusing), I've tried to hit the ground moving at a vaguely speedy pace on Storage / Data managementy things.

So, as the new Andrew Elwell, here's what I've managed to do so far:

Partly as a means of getting myself better acquainted with the arcane mysteries of the DPM, I wrote this useful little tool which produces a pretty-printed output of all the storage used on a DPM, by VO and users within the VO.
Greig and I are planning to stick it in the next release of his DPM Admin Tools package but anyone who wants a beta release can have it if they ask.

DPM performance & xrootd
After the series of ATLAS Analysis Challenges made it increasingly clear that DPM can't produce an effective event rate of greater than about 12 Hz on any of the sites in the challenge, we decided this was worth some investigation. (Interestingly, Tokyo's cluster seems to be capable of getting upto 24 Hz, with DPM.)
At this rate, the DPM head node maxes out CPU, but the network rates from the head node and the pools are very low.

It appears from the DPM logs at Glasgow that the majority of the DPM's time is spent doing X509 authentication on each get request - since each authentication takes around 1.5 seconds, and we need two per request (one on the DPM and one on the disk pool), this is the majority of the time involved in the transfers for small files like the AODs (about 30Mb each).

We thought, therefore, that we'd try disabling X509 auth on the Glasgow DPM and getting another Challenge send to us. This involves some fairly dangerous settings in shift.conf on all the DPM nodes, which we did, and it seemed to work, with a noticeable speed increase, for rfcp on a node.
For some reason, though, the ganga jobs in the Analysis challenge did this:

which is clearly not expected.
We're still not sure why running DPM in "no X509", trusted mode breaks ganga submitted jobs in this way - it didn't break any of ATLAS Production, and rfcp and lcg-cp both worked when we tested them. In any case, we undid these changes sharpish...

The next avenue for testing is alternative transfer protocols other than rfio. Luckily, we have a "spare" DPM, svr025, which I've added xroot support to (thanks to some help from Greig), and will be using to test the benefits and efficiencies of the various DPM plugins vs rfio. Next year, we'll see how I've gotten on...

No comments: