Fool that I am, I opened my laptop after getting back from Gairloch on Saturday night. As I now have Paul's MonAMI torque plots on my Google homepage, I could see that the number of running jobs was down to almost zero. This was unexpected. A quick revision of the SAM pages and monitoring plots showed the pheno job storm on Thursday had killed the CE off, big time. Being throughly disinclined to engage in extensive debugging late on Saturday night, and knowing the machine needed a new kernel anyway, I rebooted the CE. Whatever the residual problem was this cleared it. Within minutes LHCb jobs were coming in and starting properly - indeed within about 6 hours they managed to fill the entire cluster again.
The remaining problem was then timeouts on the CE-RM test. This was puzzling, but not critical for the site, so I left things as they were at this point, pending further investigation. Then I recalled that this can happen if a period of intense DPM stress (Greig and Billy have been preparing the CHEP papers) causes threads in DPM to lock up or block, leaving few threads available to service requests. I restarted the DPM daemon and bingo! all was well again. The next time this happens we should look in MySQL for pending requests (which get cleared when DPM restarts), however at 10pm on Sunday night getting things working quickly is all I care about.
My car broke down yesterday too, so I'm off to Halfords to buy it some new brake pads; but at least that happened after I came back from holiday. If only the site had the good grace to do the same. I think I get to be grumpy about this for at least 2 days.