Our cooling setup in one of the rooms is a bit quirky; based on a chilled water system (long story, but it was originally built for cooling a laser before we ended up with it). There's been a few blips with the water supply, so duely an engineer was dispatched to have a poke at it.
The 'poke' in this cases involved switching it off, until he could delve into the midsts of the machine, resulting in the rather exciting peak in temperatures (these measured using the on board thermal sensors in the worker nodes).
We were supposed to get a warning from the building systems when the chiller went offline, and again when the water supply temperature rose too high. (The air temperature lags behind the water temp, so it's a good early warning). As neither of those happened, our first warning was the air temperature in the room, followed by the nodes internal sensor alarms.
First course of action was to offline the nodes, and then find the cause of the problem. Once found, there was a short ... Explanation ... of why that was a Bad Time to switch off the chiller. We'll schedule some downtime to get it done later; at some point when we're not loaded with production jobs.
Still, little incidents like this are a good test for the procedures. Everything went pretty smoothly, from offlining nodes to stop them picking up new jobs, through to the defence in depth of multiple layers of monitoring systems.
Thankfully, we didn't need to do anything drastic (like hard powering off a rack); so we now know how long we have from a total failure of cooling until the effects kick in. Time to sit down and do some sums, to make sure we could handle a cooling failure at full load that occurs at 3am...
Update: 19/08/2010 by Mike
Never mind "sums", I took the physicist's approach a couple of years ago and got some real data:
Triangles (offset slightly along x-axis for clarity) are the temperatures of worker nodes as reckoned by IPMI; stars are input air temperatures to the three downflow units in room 141 and the squares are flow/return water temperatures. I simulated a total loss of cooling by switching the chilled water pump off; all worker nodes were operating at their maximum nominal load. It took ~20 minutes for the worker node temperatures to reach 40 degrees, at which point I bottled it and restored cooling. So, for good reason, we now run a script that monitors node temperatures, and has the ability to power them off once a temperature threshold is breached. Oh, and that has been tested in anger.