Thursday, April 08, 2010

Take my outputs, damn you...

We recently ran up a very large backlog of production output files waiting to go from Glasgow back to the Tier-1 (reminder, panda doesn't consider a job finished until the outputs are safely stored at the T1). This is clearly seen in the red line on the panglia plot above, which reaches very high values. As we recently cut the timeout for the UK cloud to 2 days for transferring jobs, to improve the responsiveness of the production system, we started to leak out failed jobs (light green line) as panda gave up and decided to rerun.

Fortunately we got a big boost in the number of FTS slots from Glasgow to RAL, increasing from 10 to 25 active transfers (see the bottom FTS monitoring plot). Even so it clearly takes 24 hours for all the backlogs to drain down.

One of the problems here is that the output files are small from simulation (a tiny log file and a 20-50MB HITS file), so the overheads of FTS + SRM are very considerable and the actual bandwidth achieved is quite low. One possibility we are considering in ATLAS is introducing a pre-merge of outputs on the T2, which will allow us to send much bigger files back to the T1 (although a final "super-merge" will probably still be necessary). For this we are waiting for the generic Athena merge transform and then we will need to test integrating this into the mainline production workflow.

Until then we just have to take the operational load of tweaking the FTS settings when necessary.

1 comment:

Chris said...

Queen Mary experienced the same problem a couple of months ago - at least in part due to Imperial's jobs using QMUL's SE.

I suspect the "real" solution is piplining support in gridFTP.