Monday, January 28, 2008

Resource Broker Blues

Our resource broker was down for the weekend as the network service stalled. Root cause turned out to be a bit of over aggressive cleaning from cfengine. I had wanted to do a better job of cleaning up the /tmp area in the cluster - each worker node had hundreds of condor_g working directories lying around - with nothing in them. cfengine's "tidy" leaves directories alone by default and only cleans files. So I enabled the "rmdirs=sub" option - works beautifully, gets rid of all the cruft in /tmp. So pleased was I that I disengaged by brain and set this option on for /home as well - good idea to clean up those old gass cache areas, isn't it? Well, almost - unfortunately /home has subdirs which are the node pool account home areas and unused pool accounts fall into the clean me up category. All the untouched pool areas then vanished.

This caused a number of people to start getting "unspecified grid manager errors" on globus-job-runs, as well as wiping out the edguser home area on the RB which caused the network server to go into crisis.

It didn't take long to work out what had happened, but fixing it took a while as the resource broker seemed to be quite huffy afterwards.

The only plus side was that I enabled the mice, scotgrid and nanocmos VOs on the RB.

