Thursday, July 12, 2007
Pheno goes Bang!
Crisis on the cluster this morning. After a long night of job submission by a phenogrid user (putting in more than 1000 jobs) the cluster went into a spasm, where the pheno jobs started to hit wait state en masse. Then what I think happened was that as torque saw each pheno job hit wait, failing to start, it immediately picked the next pheno job, tried to start that, failed, tried to start the next, and so on. This resulted in a load storm within torque (loads >100), which was then not even able to answer normal client queries - so maui locked up and the gip plugin started to timeout.
When I realised what was happening (and the pheno jobs were still coming in) I added the user's DN to the LCAS ban_users.db file. I then carried out some debugging tests, restarting maui, clearing out maui stats files, etc. In the end I saw no option but to qdel the user's waiting jobs, to attempt to take the pressure off torque.
Once the jobs were flushed out the system torque quite quickly started to recover. Maui started to respond again and the GIP plugin could get sensible answers.
Why were the jobs going into waiting state? The error the user seemed to be getting back was "Globus error 158: the job manager could not lock the state lock file." This seems to be an error which crops up when the job is being cancelled. There was a strange mix of jobs from this user - some with VOMS extensions, some vanilla proxy. Was this a problem with proxy renewal and the gatekeeper trying to cancel jobs which it no longer had the right to? The problem kicked in at almost exactly the time that the user's original submission proxy expired and the RB would have renewed it from the RAL MyProxy server. The wrong proxy might well also have affected the ability of the jobs to start - hence the wait crisis being sparked.
After I had been satisfied that the cluster was stable again, I took the user out of the banned list. Their jobs are now flowing back into the cluster, interestingly all with the vanilla proxy now.
I will keep a close eye on things and check that things don't go wrong again.
Postscript: VOMS proxy renewal is broken: http://savannah.cern.ch/bugs/?func=detailitem&item_id=15208