Wednesday, October 24, 2007
News from ATLAS Computing
There is some significant news for sites from ATLAS software week. The decision was taken yesterday to move all ATLAS MC production a pilot job system, with the pilots based on the Panda system developed by US ATLAS. The new system will get a new name, pallete and pallas are the front runners. (I like pallas myself.)
In addition EGEE components, such as LFC, will be standardised on for DDM.
As this is a pilot job system, so the anticipated model is that ATLAS production will keep a steady stream of pilots on ATLAS T2 sites, which pull in real job payloads from the central queue.
This is very like the LHCb MC production model, so as the transition is made to this system sites should start to see much better usage of their resources by ATLAS - just like LHCb are able to scavenge resources from all over the grid.
Canada have recently shifted to Panda production, and greatly increased their ATLAS workrate as a result. There have been some trials of the system in the UK and France, which were also very encouraging.
Of course, there is a very large difference with LHCb, because ATLAS don't just do simulation at T2s, but also do digitisation and reconstruction. These steps require input files and the panda based system for dealing with this is to ensure that the relevant input dataset is staged by the ATLAS distributed data management system (DDM) on your local SE; further the output dataset will be created on your local SE, then DDM will ship it up to the Tier-1 after the jobs have run.
This means that sites will really require to have a working storage system for ATLAS work from now on (in the previous EGEE production model, any of the SEs in your cloud could be used, which masked a lot of site problems but caused us huge data management headaches).
In the end using pilots had two compelling advantages. Firstly, the sanity of the environment can be checked before a job is actually run, which means that panda gets 90% job efficiency (the other EGEE executors struggled to reach 70%). Secondly, and this is the clincher, it means that we can prioritise tasks within ATLAS, which is impossible to do otherwise.
At the moment the push will be to get ATLAS production moved to this new system - probably on a cloud by cloud basis. This should not cause the sites headaches as production is a centralised activity (most sites still have a single atlasprd account anyway). However, the pilots can also run user analysis jobs - and this will require glexec functionality. Alessandra and I stressed to Kors that this must be supported in the glexec non suid mode.
In the UK ATLAS community we now need to get our DQ2 VOBox working properly - at the moment dataset subscriptions in the UK are just much too slow right now.
Postscript: I met Joel at lunchtime, who wanted me to namecheck LHCb's DIRAC system, as panda is based on DIRAC - well, I didn't know that, but I suppose I'll learn a lot more about the internals of these things in the next few months.