Tuesday, July 10, 2007

Multiple VO Woes

Steve Lloyd and I sat down after lunch today to try and get to the bottom of why his dteam submitted jobs always fail. Strangely this seems to be a RB specific problem. IC always works, Glasgow always fails and RAL seems to come and go.

Using the Glasgow RB we submitted a job to Edinburgh, so that we could trace things through the batch system. The job arrived at Edinburgh, and ran through the batch system. However, it continued to be considered by the RB as

Current Status: Scheduled
Status Reason: Job successfully submitted to Globus

Clearly this was not the case.

We had a good look through the logs on the RB, but there's no particular sign of things going wrong there - although it must be said that the logs are both dense and impenetrable.

When it became clear that there was no easy solution I decided to try and reproduce the problem myself. Now, recall I had joined gridpp a while ago to help our local users and never had any trouble. However, now I can't seem to get a single job running through as a gridpp member - even on the Glasgow cluster. And things are in fact even worse than for Steve, because my gatekeeper process dies almost instantly, so the job never even goes into the batch system:

grep 2007-07-10.14:49:10.0000028268.0000113028 /var/log/messages
Jul 10 14:49:16 svr016 gridinfo[9672]: JMA 2007/07/10 14:49:16 GATEKEEPER_JM_ID 2007-07-10.14:49:10.0000028268.0000113028 for /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=graeme stewart on 130.209.239.23
Jul 10 14:49:16 svr016 gridinfo[9672]: JMA 2007/07/10 14:49:16 GATEKEEPER_JM_ID 2007-07-10.14:49:10.0000028268.0000113028 mapped to gridpp001 (17601, 10016)
Jul 10 14:49:16 svr016 gridinfo[9672]: JMA 2007/07/10 14:49:16 GATEKEEPER_JM_ID 2007-07-10.14:49:10.0000028268.0000113028 has GRAM_SCRIPT_JOB_ID 1184075356:lcgpbs:internal_434559272:9672.1184075355 manager type lcgpbs
Jul 10 14:49:16 svr016 gridinfo[9672]: JMA 2007/07/10 14:49:16 GATEKEEPER_JM_ID 2007-07-10.14:49:10.0000028268.0000113028 JM exiting

I'll now try and poke around inside the gatekeeper logs and see if I can come up with any indication why things are going wrong.

And what the hell's this got to do with the RB anyway? It's deeply puzzling and frustrating in equal measure.

No comments: