When I started to fiddle with the UI and RB on Saturday night, I discovered that the site was failing SAM tests, with the, as usual, marvellously descriptive error "Unspecified gridmanager error".
Further investigation showed that torque and maui servers were not running. When I restarted them the site recovered immediately. The very curious thing was, though, that torque logfile entries were still being written - so there was some part of torque running, but not enough to accept new jobs.
We need a nagios alarm on this. Paul tells me that there is a torque.available metric in the MonAMI sensor, so we should be able to passively monitor this - see the above graph which shows the dropout on Saturday afternoon.
1 comment:
sorted. Nagios now has a check called 'batch status' which alerts on either torque or maui going AWOL
Post a Comment