Monday, April 30, 2007

Intervention At Edinburgh

I had to intervene at Edinburgh 2 weeks ago (14th, just before I went to London for the T2 review). They had been failing JS since the Friday night. Logging on I could see a stack of ops jobs, but nothing running on several WNs.

I tried starting the oldest ops job using runjob -cx, but that didn't work, giving the error:

04/14/2007 16:54:30;0080;PBS_Server;Req;req_reject;Reject reply code=15057(Cannot execute at specified host because of checkpoint or stagein files), aux=0, type=RunJob, from root@ce.epcc.ed.ac.uk

Not at all clear to me what was going on. I tried running different ops jobs and they all started and ran properly, so in the end I deleted that job from the queue and that seemed to ungunge things.

torque seems to produce rather unhelpful information in these sort of cases, unless I'm just looking in the wrong places.

2 comments:

SteveT said...

The following may be solution but it is not tested. I don't think it will do any harm anyway.

Create a submit filter to add
-c n to job submission.

i.e create a file /usr/local/sbin/submit_filter.sh
containing:

#!/bin/sh
echo "#PBS -c n"
while read J
do
echo $J
done

# chmod a+x submit_filter.sh

and then enable it by adding

SUBMITFILTER /usr/local/sbin/job_filter.sh

to

/var/spool/pbs/torque.cfg

create the file if need be. I expect a pbs_server restart will need doing.

Afterwards check that newly submitted jobs have in qstat -f <number>>

Checkpoint = n

and not

Checkpoint = u

as I expect they all now have?

This is only going to help new jobs so time will tell.

SteveT said...

Do you feel this has ever been fixed for you.
Do you have a magic solution?