Thursday, November 25, 2010

Stale CREAM and Maui partitioning

Nothing terribly exciting; but we've done a bit of an update on our Maui configuration, and CREAM problem has been cleared a-whey.

Previously, we've had our different era's of compute nodes annotated with the key speed notifier; so that the scheduler understands the fact that some are faster. For the vast majority of grid jobs, this is an utterly irrelevant distinction - so long as the job gets the time it expects (and Maui is scaling requests that go to the slower nodes so they get longer).

However jobs that use more than one process (i.e. MPI jobs) are a different case - if they get scheduled with different classes of nodes, then you get sub-optimal resource useage. So it's useful to keep some distinction between them. Previously we've been using reservations to restrict where jobs can go - but there's an (ill-defined) upper bound to the maximum number of overlapping reservations on a single job slot at once; too many breaks things.

So we had a look at Partitions in Maui, which is really the proper way to handle these. The downside is that you're limited to 3 partitions - there's a compiled in limit of 4, one of which is the special [ALL] partition, and one is DEFAULT. Fortunately, we have 3 era's of kit - and as long as we're happy calling one 'DEFAULT', it all works. And Maui understands never to schedule a job to more than one partition at a time.

So we ended up with a lot of lines like:
    NODECFG[node060] SPEED=0.94 PARTITION=cvold
But in order to make them used by all jobs, we had to adjust the default partitions available to include them all:
    SYSCFG PLIST=DEFAULT:cvold
which gives all users equal access to all the partitions.

Not terribly exciting Maui tweaks, but sometimes that's the way of it.


The CREAM problem manifested itself as a set of jobs that have gone sour. There's a known problem in the current versions of CREAM where if the job is queueing on PBS for over an hour, CREAM (rather, the BLAH parser) thinks it's dead, and kills the job, with reason=999.

What's not made explicit is that you need to have no spaces in that! i.e. you must have

    alldone_interval=7200

because if you put alldone_interval = 7200, then CREAM doesn't understand that. So fixed that, and it was all hunky dory for a while. Then we started getting lots of blocked jobs in Torque again; all from CREAM.

Cue more digging.

Eventually found this in the CREAM logs (after a slight reformatting):

JOB CREAM942835851 STATUS CHANGED: PENDING => ABORTED
[failureReason=BLAH error: no jobId in submission script's output
(stdout:) (stderr:/opt/glite/etc/blah.config: line 81: alldone_interval: command not found-
execute_cmd: 200 seconds timeout expired, killing child process. )

So, two things here. Firstly, alldone_interval with spaces had crept back in, along with the correct version - in our case via cfengine directives, probably down to the Double CREAM plan. More interesting was that having the invalid part of the BLAH config present slows down BLAH (BUpdaterPBS was pegging a CPU at 100%), sufficent that it hits another timeout at 200 seconds to respond at all. And then CREAM kills the job, but doesn't actually tell Torque, sand box is blown away, so it can't finish (nowhere to put output), or, if not started, can't start.

Removing the second (wrong) version of the alldone_interval fixed that - CPU use in the parser dropped to trival levels, and all appears to be happy again. This one's not really CREAMS fault, but it's always good to have an idea of what misconfigured services end up doing, otherwise it's hard to fix those 'not enough coffee' incidents. Hence, this one for Google...

UPDATE: Oops! Spoke too soon. That's the problem defined, but clearly not the solution - as it's happened again. Gonna leave this here as a reminder to self to give it a bit longer before considering something fixed...

No comments: