Friday, June 13, 2008

Who's crying now?

Our poor WMS was killed from the end of last week when a pheno user submitted about 20k jobs into it. Worse, they hadn't used a proxy renewal service, so their VOMS extension on the proxy expired and so the jobs they had submitted suffered shallow failures, promting further resubmission attempts and further load.

We have contacted the user, but they were having a great deal of trouble even cancelling the jobs. We've left things for 2 days now, but the situation is really not improving. I can't see much hope for the machine in its current state - we'll probably have to blow it away and start again next week.

For the moment we've asked our users to revert to using ye olde RB, which we were just about to switch off but hadn't actually decomissioned.

There's a serious question now about what level of WMS service we want to provide. It's a complex service and rather difficult to debug when it goes wrong. Certainly an upgrade to the SL4 version should be done, but do we want to have 2 WMS hosts and possibly even a separate LB service?

