

Between 12 and 1pm yesterday the site BDII stopped reporting on the queue statuses. We suffered the classic problem of reporting 4444 queued jobs and 0 job slots available.
You can see from the plots from the sBDII that the amount of network traffic clearly dips. In fact the gstat graphs show clear a dip in the number of entries. So it was the CE's GRIS was misbehaving at this point.
I checked the CE for load, job floods, etc. There was nothing abnormal - we got 20 jobs in 1 minute, but we should be able to cope with that ok. There are no logs for the running slap daemon and nothing odd spotted in /var/log/messages.
We also aborted on a few of Steve Lloyd's tests. However, very weirdly the RB logs are showing an attempt to match a queue on host gla.scotgrid.ac.uk. Where did the name of the CE go? Is this related to the CE GRIS getting in a pickle?
So, one of these anomalous blips in the crappy information system again.
No comments:
Post a Comment