Tuesday, April 03, 2007

GRIS Wobbles in Glasgow CE

Between 12 and 1pm yesterday the site BDII stopped reporting on the queue statuses. We suffered the classic problem of reporting 4444 queued jobs and 0 job slots available.

You can see from the plots from the sBDII that the amount of network traffic clearly dips. In fact the gstat graphs show clear a dip in the number of entries. So it was the CE's GRIS was misbehaving at this point.

I checked the CE for load, job floods, etc. There was nothing abnormal - we got 20 jobs in 1 minute, but we should be able to cope with that ok. There are no logs for the running slap daemon and nothing odd spotted in /var/log/messages.

We also aborted on a few of Steve Lloyd's tests. However, very weirdly the RB logs are showing an attempt to match a queue on host gla.scotgrid.ac.uk. Where did the name of the CE go? Is this related to the CE GRIS getting in a pickle?

So, one of these anomalous blips in the crappy information system again.

