I had to intervene at Edinburgh 2 weeks ago (14th, just before I went to London for the T2 review). They had been failing JS since the Friday night. Logging on I could see a stack of ops jobs, but nothing running on several WNs.
I tried starting the oldest ops job using runjob -cx, but that didn't work, giving the error:
04/14/2007 16:54:30;0080;PBS_Server;Req;req_reject;Reject reply code=15057(Cannot execute at specified host because of checkpoint or stagein files), aux=0, type=RunJob, from email@example.com
Not at all clear to me what was going on. I tried running different ops jobs and they all started and ran properly, so in the end I deleted that job from the queue and that seemed to ungunge things.
torque seems to produce rather unhelpful information in these sort of cases, unless I'm just looking in the wrong places.