Thursday, June 07, 2007

Glasgow CE Flaky

It looks like we're having CE problems at Glasgow. We failed a SAM test at 1am, with the error "Globus error 79: connecting to the job manager failed. Possible reasons: job terminated, invalid job contact, network problems, ...". This is very reminiscent of the errors seen by our engineers who use GRAM submission.

We also seemed to fail to run atlassgm tests for long enough that we've been blacklisted by ATLAS in the FCR - though here the tests are just missing, so I don't know what went wrong.

We passed a test at 1030, so hopefully we'll be back in soon.

It's urgent that I get to the bottom of this.

I have checked the gatekeeper logs, and the jobs are being mapped properly to atlassgm at regular intervals. I have checked the WNs and there's nothing evil here - ssh working, disks not full, nfs mounts ok. The exit status of all the jobs from the batch system is 0. The failing jobs were not consistently given to one WN, which might explain the issue. We even passed tests on node040 yesterday, then failed in the early hours of this morning.


