ScotGrid: Leaks caused by frozen ICE

We had a rather quiet time here over winter - a slight hiccough with a disk server, but all rather stable. Other than that, the big freeze didn't result in much.

Except for the ice freezing up, and causing leaky pipes.

That's ICE - the WMS plugin that submits to CREAM. It turns out that it can break the pipes, and leak bits of past jobs. This resulted in an error message like:

Warning - Unable to submit the job to the service: https://svr022.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server
System load is too high:
(Not all processes could be identified, non-owned process info will not be shown, you would have to be root to see it all.)
Threshold for ICE Input JobDir jobs: 1500 => Detected value for ICE Input JobDir jobs /var/glite/ice/jobdir : 1514

from both WMSen. In principle, this is reasonable: it's saying that the WMS is loaded up, so no more jobs for the moment. A decent way of ensuring running jobs are not harmed by new submissions, in the event the system explodes.

Except the system load was the lowest I've seen on them, under 1.0. Dug dug into the underlying Condor instance, which had only a few jobs in it, and the hunt commenced for the 1500 phantom jobs.

As the error message suggests, /var/glite/ice/jobdir/old had 1514 files in it, each one representing a past job. However, most of these were old - over a month old. Given that the WMS is supposed to purge the jobs after a month (if they user doesn't do it earlier), that shouldn't have been the case.

Derek down at RAL confirmed this - it's apparently a known bug; but I can't quite see it on the gLite Known issues page. It looks like most of the UK's WMS's fell to this at the same time. I think that's due to the increased number of CREAM CE's (so the rate of use of ICE is climbing), and the fail over on the clients if one WMS is down - resulting in a nice, even distribution of failure.

In the end, the fix was simple - I moved the files older than a month out of /var/glite/ice/jobdir/old. Deletion aught to be safe, but they're tiny. I'll need to automate that, until such time as the bug is fixed - but also need to watch in case the usage increases further, and 1500 isn't enough to last out a month of use. In that case, I think I'd probably temporily increase the
limit on the WMS (I believe it's in a configuration file), knowing that most of them are stale phantoms.

The only discussion I can find related to that error message resulted in pointing the finger at the glite-wms-ice-safe process. ICE has two processes, and it appears that the ice-safe is the part responsible for cleaning up. However, as far as I can tell, both processes are running on each of our WMS's, so this appears to be a different case from the previous one. It might have been the case that the ice-safe process died, and when it's restarted it's not removing old jobs? I don't know - if I find out I'll update here.

The purpose of this post is to get the error message from the WMS into google, and on the same page as something that talks about the issue; and resolution. In case it freezes up on us again.

ScotGrid

Tuesday, January 12, 2010

Leaks caused by frozen ICE

1 comment:

Labels

Contributors