We then started to fail SAM tests, drop out of the ATLAS BDII, Steve's tests then couldn't resource match us, and so on. Bad day.
It took quite a few hours to sort out the mess, and a further few hours to stabalise the site.
Our GGUS ticket names and shames.
It's a very different ball game on the grid - we have 6000+ users who can submit jobs, and it's not hard to kill a worker node. The torque generally handles this badly and all hell breaks loose.
Action plan:
- Nagios alarms on WN disk space
- Group quotas on job stratch area
1 comment:
watching and cleaning /tmp with cfengine is good :) of course if it gets too bad, cfengine can drop the node from pbs :)
Post a Comment