Thursday, August 19, 2010

Why, yes ... we were using that...

So .... remind me never to do a 'nothing much happening' post again. It looks like tempting fate results in Interesting Times.

Our cooling setup in one of the rooms is a bit quirky; based on a chilled water system (long story, but it was originally built for cooling a laser before we ended up with it). There's been a few blips with the water supply, so duely an engineer was dispatched to have a poke at it.
The 'poke' in this cases involved switching it off, until he could delve into the midsts of the machine, resulting in the rather exciting peak in temperatures (these measured using the on board thermal sensors in the worker nodes).

We were supposed to get a warning from the building systems when the chiller went offline, and again when the water supply temperature rose too high. (The air temperature lags behind the water temp, so it's a good early warning). As neither of those happened, our first warning was the air temperature in the room, followed by the nodes internal sensor alarms.

First course of action was to offline the nodes, and then find the cause of the problem. Once found, there was a short ... Explanation ... of why that was a Bad Time to switch off the chiller. We'll schedule some downtime to get it done later; at some point when we're not loaded with production jobs.

Still, little incidents like this are a good test for the procedures. Everything went pretty smoothly, from offlining nodes to stop them picking up new jobs, through to the defence in depth of multiple layers of monitoring systems.

Thankfully, we didn't need to do anything drastic (like hard powering off a rack); so we now know how long we have from a total failure of cooling until the effects kick in. Time to sit down and do some sums, to make sure we could handle a cooling failure at full load that occurs at 3am...

Update: 19/08/2010 by Mike

Never mind "sums", I took the physicist's approach a couple of years ago and got some real data:

Triangles (offset slightly along x-axis for clarity) are the temperatures of worker nodes as reckoned by IPMI; stars are input air temperatures to the three downflow units in room 141 and the squares are flow/return water temperatures. I simulated a total loss of cooling by switching the chilled water pump off; all worker nodes were operating at their maximum nominal load. It took ~20 minutes for the worker node temperatures to reach 40 degrees, at which point I bottled it and restored cooling. So, for good reason, we now run a script that monitors node temperatures, and has the ability to power them off once a temperature threshold is breached. Oh, and that has been tested in anger.

Business as unusual

There's a been a lot of little things happening up here; individually none of them quite big enough to blog about.

And after a while, it's worth doing a catch up post about them. This is that post.

David started a couple of weeks ago, and Mark is starting on Monday; just in time for the GridPP meeting. It's seeming to be a tradition that every time we get new hardware, the staff rotate; Dug and myself started just around the last hardware upgrade.

The hardware this time is mostly a petabyte of storage to be added, so David's been working on ways of testing the disks before we sign off on them.

GridPP; next week. Usual round of site reports, and future planning. With the data from the LHC now a routine matter, it's time to start thinking about future needs. I'll be talking about non-(particle)-physcists on the Grid, as a nod towards the longer term EGI picture.

We noticed some load balancing issues on our SL5 disk pool nodes; Sam's been poking at that, and it looks like there's a mix of issues, from filesystem type (ext4 is better than xfs here), and clustering of files onto nodes.

And that's most of the interesting stuff from up here. Hopefully we'll have more to post about over the next few months