Thursday, July 05, 2007

LHCb Stuck Jobs

Coincidentally with drafting the stalled jobs document, we got 23 stalled LHCb jobs last Friday. These jobs had consumed about a minute of CPU then just stopped.

I reported them to lhcb-production@cern.ch and the response from LHCb was very swift and helpful. We did quite a bit of debugging on them - although in the end we had to confess that exactly why these ones had stalled was something of a mystery. At first LHCb thought that NFS might have gone wobbly at our end, so the jobs got stuck reading the VO software. From what I could see this was unlikely, and when NIKHEF, RAL and IN2P3 reported similar problems we were off the hook.

Some useful tools for stuck jobs:
  • lsof - see what file handles are open
  • strace - what's the job doing right now
  • gdb - attach a debugger to the code
In fact, a lot of simple diagnostics also help: what's in the job's running directory. What STDOUR/STDERR has been produced to far, etc.

When these jobs are killed it's helpful to poke the stalled process - that way information gets back to the VO. A qdel will see the outputs all lost and the job resubmitted elsewhere, which is far less helpful.

In the end, whatever the bug is, it's down at the 10^-6 level!

Thanks to LHCb for being so responsive.

I also must take my hat off to Paul and his MonAMI torque plugin. His live efficiency plots for the batch system queues made spotting this very easy. In the past this sort of thing would have been noticed on a very hit or miss basis.

No comments: