Last time I was blogging, I mentioned some problem with our CREAM CE, and too many jobs in the Blah Registry.
Unlike my initial theory, the all_done interval problem turned out to not be the culprit; instead it was down to the Blah Registry.
CREAM splits the whole deal with being a Compute Element into two main parts: the interaction with the wider world, which is handled with some Java code using Tomcat; and the direct interaction with the batch system, called BLAH, and written in C and shell script.
The Java code, which I'll refer to as CREAM, as distinct from the BLAH parts, keeps it's state in the MySQL database. BLAH, on the other hand, uses a hand rolled indexed file, with C functions for accessing and writing data.
The BLAH registry is updated by the command blah_job_registry_add after the qsub is complete; to record the mapping between the CREAM job ID and batch system job id. This is the step were we ran into problems. The version of CREAM we were running was set to purge jobs after about two months - and in two months we were putting just over half a million jobs through it.
With that many jobs in the registry, it was taking a noticeable time to add any job. Further, the locking done effectively serialises access to the registry (i.e. Table locking in RDBMS parlance). Couple that with the Atlas pilot factory's favourite habit of dumping jobs in batches of 10 to 20 at a time, and you can see how some jobs ended up taking longer than the timeout to register.
Just before we'd encountered this, there was a new version of CREAM released (glite-CREAM-3.2.8) that cut the default time before purging to about one month, and put the indices in a mmaped file; both should mitigate this problem. We limped along with some workarounds for a bit , before doing that update earlier this week. The update from 3.2.7 to 3.2.8 went very quickly, by the way; took us about 5 minutes; although we did have to manually tidy up /etc/sudoers.
As it stands now, with about quarter of a million jobs in the registry, it's taking about a couple of seconds to register a job; but with occasional pauses when there are many jobs pending. Thus far it's prevented a recurrence of large number of blocked jobs, but I'll be keeping an eye on it.
 The other CE's were having hardware issues, and we didn't want to have all the CE's down at once...