Friday, August 17, 2007

Pheno Attacks!

Well, to round off a dreadful week in terms of users slapping the whole system about, the CE load spiked and we hit a System CPU storm again this afternoon. Fortunately this time I caught it within an hour and killed off the offending processes, and the system managed to recover ok.

Very oddly it was caused by a pheno user's gatekeeper processes stalling - gobbling CPU, but failing to submit any jobs onto the queue. I had to add their DN to the banned list and kill off these processes pending further investigation.

Action plan:
  • Nagios alarm on cpu_system > 10%.

