Tuesday, February 20, 2007
Mass job exit on UKI-SCOTGRID-GLASGOW
The Glasgow site had a very odd, and severe, drop in the number of running jobs at 0547 yesterday. At first I thought there had been a networking outage, which had chopped the jobs, but on further examination it seems not. The 100s of jobs which exited were all owned by a single (non-production) ATLAS user and seemed to exit "naturally" - they hadn't exceeded any queue limits.
The jobs were also running with a high CPU efficiency (e.g., cpu/wallclock of resources_used.cput=07:19:17 resources_used.walltime=07:50:56) - but they all exited at once.
Either the user must have found a problem with the jobs and cancelled them all (I wonder how one would see that from the gatekeeper logs, there doesn't seem to be any obvious way) or the jobs had a suicide pact.
Unfortunately this must have caused sufficient load on the CE to have us drop out of the information system for a time. Even running a separate site BDII doesn't seem to be a cure for all information system ills.