Tuesday, July 03, 2007

ATLAS Software Week

Last week I was at ATLAS Software Week at CERN.

It was a useful meeting (as ever meeting people and chatting is most important!). Some issues I picked up for ATLAS sites were:
  1. Although 13.0.10 has been released there are quite a few things known broken (event generation, for instance). This means we are stuck with having a lot of "old" ATLAS software releases on our sites. At Glasgow we have 86GB of ATLAS software - more then 60% of the total for all VOs.
  2. Preparations for Computing System Commissioning and the Final Dress Rehearsal are underway. The start date seems to have slipped (was meant to start this week)? Actually, I must find out what the site involvement schedule actually is.
  3. The DQ2 data management system was upgraded to 0.3 last week. There were a few teething troubles, but the next release should handle many common problems much better.
  4. There's pressure not to run too many simulations as part of each job sent to a site - so keep the wallclock down (< 24 hours), but this reduced the file sizes. Small files are a big problem - they are inefficient to transfer and gunge up any tape system. So they should really be merged before any migration to tape. (Problem for CASTOR though, which even puts T0D1 stuff onto tape?)
  5. Event sizes keep going up. Computing TDR had ESD at 0.5MB, but currently this is 1.6MB (1.8 for MC). Probably a realistic target will be 1.3MB files.
  6. Memory footprints are rising too. 2GB necessary for simulation and probably a subset of reconstruction jobs too.
  7. To deal with merging and pile-up jobs worker nodes should now be speced with at least 20GB of disk space per core. At the moment, however, jobs will try and limit their ambitions to 10GB. However, this requirement also seems monotonic, so make sure it's accounted for in forthcoming purchases.
  8. Queues for ATLAS production should be around 24 to 36 hours of cpu and wall time (N.B. this is on modern CPUs). NIKHEF are currently at 24/36 and I'm going to cut Glasgow back to 36 hours.
  9. If you see stuck ATLAS jobs try and investigate the problem and report to atlas-comp-oper@cern.ch. This will help cut off the nasty tail in the ATLAS efficiency curve.

No comments: