Friday, August 11, 2006

After being suspicious for some time that the CMS jobs we were taking were not actually completing, successfully, but instead just running out of CPU time I finally got around to having a look at the accounting logs on PBS. This seemed to bear out my fears: the exit status of the jobs was always 271, with a CPU usage some tens of seconds over the 48:00 limit (which had been set by YAIM when it set up the queues).

I wonder why we never got a ticket? We have processed hundreds of these jobs and I suppose that all of them have failed and been wasted.

I have now doubled the maximum CPU time to 96 hours, and bumped the wall time to 144 hours. After this weekend we'll see if that was enough.

PS. The documentation for torque's qmgr is terrible - I had to almost guess the resource name for CPU time. Yuk!

No comments: