Tuesday, May 29, 2007

The Road To Hell Is Paved With Data Management...


After running rather full for more than a week, we had lots of jobs which seemed to be hanging. On investigation most of these were due to stuck data management commands - mostly lcg-cps. While lcg-cp failed to exit the job was idling as it slowly approached its max wallclock when it would be killed.

As we had a queue of jobs which wanted to run it seemed absurd to leave these slackers until the batch system sent them to their fiery doom. Better to kill them off early and get some well behaved jobs in.

With a small minority of ilc jobs I killed the lcg-cp by hand, and a few of them managed to restart properly. However, when I discovered a biomed user with 110 jobs hung it was going to be far too tedious to try and script checkjob, ssh, grep and kill to attempt to jump start them - so I wielded my trusty qdel.

Unfortunately, qdeling 110 jobs at the one time produced a rather fierce load spike on the CE. The GRIS plugin then timed out and we spasmed into 4444 waiting jobs and the usual 68 years of ERT. Woops! Next time I'll delete them more slowly.

I contacted the users via the CIC portal interface that Alessandra had suggested. For this purpose it seems to work rather well.

Finally, after all of that, we got 3 jobs that were sitting in "W" state. No ideas, nothing useful in the PBS logs and many other things to do, so finally I had to qdel them as well.

It takes a long time to clear out this stuff - a couple of hours at least. Fortunately it doesn't seem to happen too often.

No comments: