Monday, February 25, 2008
Funeral March for the Lost CE
So, here's the post mortem on the CE hard crash on Wednesday last. About 1700 the load on the CE started to ramp up and it quickly rose to almost 100. I could see this happening just as I was about to go home (typical!) so I started to indulge in a frantic bout of process killing to reduce load and bring the CE back under control. However, despite my best efforts, the CE crashed hard at 1800 (gap in the ganglia plot).
When the machine rebooted, the gatekeeper restarted and again the load began to rise. I then went through a frantic couple of hours trying to do everything I could to reduce the load and try an get the CE back on an even keel - this was made very hard by the fact that with load averages quickly rising to 60+ the machine was extremely sluggish.
I shut down R-GMA, turned off the mail server to no avail. I killed off queued jobs in the batch system, even got as far as disabling VOs, and banning users whose jobs I had cancelled. I even got so desparete as to firewall the gatekeeper from all but the ScotGrid RB! But although I coud slow down the load increase by doing this, by 10pm it became clear that something dreadful had happened to the gatekeeper. Every gatekeeper process which was forked stalled, consuming CPU and managing to do absolutely nothing. As there was no response, the RB then contected the CE again, forking off another gatekeeper and the march to death continued. If I reduced the number of users able to contact the CE this slowed down the rate of resource exhaustion, but could not stop it. Clearly something utterly evil had happened to the gatekeeper state.
At this point I became convinced that nothing could be done to save the remaining queued or running jobs and that the site was going down. I started to think instead about moving our March downtime forwards, to do the SL4 upgrades, and to prise the CE and the batch system apart. And of course, that is just what we did at the end of last week.