Wednesday, December 13, 2006

We started failing SAM tests yesterday with the rather unhelpful "Unspecified job manager error". This usually means there was a problem starting the job at all. As usual, debugging this is a black art - nothing useful at all in globus-gatekeeper.log or the torque logs. Finally found emails from the batch system to the pool account users stating that there was no space in /home so their job could not be started. Poked around and it was clear it was /home on the WNs which was the problem.

Investigating I found that /home had been filled up with ATLAS jobs which had seemingly unpacked >3GB of software into their torque working directories - filled up /home and left the node dead to the batch system. Eventually we'd built up such a stack of these that SAM jobs found there way here and started to fail. It took quite some time to identify all the relevant WNs and clear out the mess.

Raised a ticket on ATLAS - got a quick response from Fredric and then from Rod. It seems that EDG_WL_SCRATCH was wong. Arggg, so partly my fault after all! I note that in the environment there were 4 environment variables pointing to /tmp (TMPDIR, EDG_TMP, GLITE_LOCATION_TMP, LCG_TMP) but EDG_WL_SCRATCH was pointing to the non-existant /local/glite (initially I had intended to put the large "scratch" partition on the WNs here, but later changed my mind to /tmp).

However, how can ATLAS jobs run with any sort of efficiency if they try and unpack GBs of software?

Things look a bit better for recent jobs, but there is a stack of ~400 ATLAS jobs which seem stuck in the cluster - 20s CPU time but up to days of wall time. I will email Fredric and see if I can delete them to unblock things.

No comments: