Thursday, August 09, 2007

Pheno goes bang, take two!

A problem started at about 02:45 this morning. The large number of pheno jobs that had accumulated in queued state started fail when run. Once failed, the job would go into waiting state, triggering maui to decide which job to run next.

With the current usage and fairshares, Maui's decision is to run the (apparently) broken pheno jobs. This keeps the server-load high and starves the cluster of long-running jobs (there's been 1-min avr load spikes of over 600!).

Look familiar? Here's a entry with very similar symptoms.

I'm in the process of trying to get to the bottom of what's actually happening, but I've started deleting the jobs as they clearly cannot run and are causing a detrimental effect on the cluster.

No comments: