Tuesday, November 14, 2006

Subject: batch system completely screwed
Date: Mon, 13 Nov 2006 16:19:31 +0000

David, Tony

I cannot seem to get the batch system stable after this weekend's mess (see http://scotgrid.blogspot.com/2006/11/cluster-had-its-first-weekend-down.html).

torque and maui will run for some 10s of minutes and then lock up and need restarted. The they run for a while more and die again. I have suspicions that some of the worker nodes are bouncing jobs, but with the general unreliability of the system at the moment this is hard to demonstrate one way or the other,

Clearly there's something deeply screwed up here. The rate of dying nodes with MCE errors probably isn't helping anyone.

I don't seem to have any option but to put the site into downtime and completely gut the batch system.

However, I will also take the opportunity to use cfengine to redo the CE and incorporate local type accounts into the batch system as well.

Various deeply offensive words should be said about the unhelpfulness of torque and maui in this situation.

On the bright side, cfengine is now doing a splendid job on the worker nodes and this should be extended.

*sigh*

Graeme

No comments: