Friday, September 28, 2007
Biomed Stalled Jobs
Since we came back after the upgrade to SL4 I had noticed a very large number of stalled biomed jobs on the cluster.
These were all jobs which had stalled running python ./get_task.py autodock AVIANFLUDC2_T02IAN3J1170 (or something very like it).
As the cluster hadn't actually been full, and I was very busy, I actually let this situation go for most of the week. However, today I emailed the user (using the CIC portal user look up). I got a very quick response that there was a known problem with an overloaded AMGA server, which was causing these stalls. I was given permission to kill the jobs, which I did.
Although it's a good thing (tm) to get in touch with users, following our stalled jobs guide, it is time consuming and I wish there was some form of automation we could apply.