Tuesday, September 25, 2007

Batch System Goes on Holiday?

When I started to fiddle with the UI and RB on Saturday night, I discovered that the site was failing SAM tests, with the, as usual, marvellously descriptive error "Unspecified gridmanager error".

Further investigation showed that torque and maui servers were not running. When I restarted them the site recovered immediately. The very curious thing was, though, that torque logfile entries were still being written - so there was some part of torque running, but not enough to accept new jobs.

We need a nagios alarm on this. Paul tells me that there is a torque.available metric in the MonAMI sensor, so we should be able to passively monitor this - see the above graph which shows the dropout on Saturday afternoon.

1 comment:

Andrew Elwell said...

sorted. Nagios now has a check called 'batch status' which alerts on either torque or maui going AWOL