Tuesday, July 03, 2007

Health and Efficiency

As part of investigating the problems of stalled jobs, I have plotted Wall vs. CPU time for ATLAS and LHCb on our cluster.

LHCb jobs are generally quite efficient (as evidenced by their 93% efficiency from the EGEE accounting pages). What's interesting is seeing the cluster of jobs at 11 and 22 hours of CPU time, with a smear in wall clock from prefect efficiency to ~50% (data management stikes again?).

ATLAS jobs have a far more variable profile, with many more short jobs of high efficiency, with a more general, and flatter line out to lower efficiencies. There's a very distinct line of problematic jobs (the spike on the tail).

It seems really that with our new fast CPUs our queue times are really much too long (inherited from the old cluster, if I remember). LHCb and ATLAS both seem happy for queues to be reduced from 96/100 hours to 36/36 hours.

