Tuesday, August 14, 2007

Bad, bad, biomed....

A very flaky day - we had a biomed user throw jobs into the system which were trashing workernodes by filling up /tmp. This caused lots of nodes to get into a weird state where they seemed to run our of memory (ssh and ntp nagios alarms firing). Jobs couldn't start properly on these nodes, so they became black holes: one of our local users lost 47 of 50 jobs, another lost 124 of 150.

We then started to fail SAM tests, drop out of the ATLAS BDII, Steve's tests then couldn't resource match us, and so on. Bad day.

It took quite a few hours to sort out the mess, and a further few hours to stabalise the site.

Our GGUS ticket names and shames.

It's a very different ball game on the grid - we have 6000+ users who can submit jobs, and it's not hard to kill a worker node. The torque generally handles this badly and all hell breaks loose.

Action plan:
  1. Nagios alarms on WN disk space
  2. Group quotas on job stratch area

1 comment:

Colin Morey said...

watching and cleaning /tmp with cfengine is good :) of course if it gets too bad, cfengine can drop the node from pbs :)