We went though a little rash of SAM test failures last night. This turned out to be an LHCb user who was submitting jobs which filled up the scratch area on the worker nodes and turned them into blackholes.
Obligatory GGUS ticket was raised.
We do alarm against disk space filling up on the worker nodes, but it was still 4 hours before action was taken and the nodes set offline before being cleaned. In that time an awful lot of jobs were destroyed. Make me think we might want to automate the offlining of nodes which run out of disk space, pending investigations.
No comments:
Post a Comment