Wednesday, December 21, 2011

Batch system juggling

We've been a bit quiet up here recently. This is normally a sign of either nothing interesting happening, or entirely too many interesting things happening. Opinions on that may divide, but I think it's closer to the latter...

One of the recent bits of fun that occurred was with our batch server. This story actually starts a long time ago; about this time last year. At that point, we started to get intermittent memory errors from the Torque server - corrected by ECC - but that's generally a sign that the RAM's about to fail. Given that the batch server is single point of failure for a site, that's not a good thing.

So I spent some time preparing a spare box, and being ready to move the batch system over, in case it failed over the winter break. Which, after all that prep, it didn't, and the errors stopped. On the expectation that the current hardware was nearing end of life, we ordered a new box early this year, and have had it sitting in a machine room for a while.

Unfortunately we didn't get time to have it running a tested batch system until our power supply started to ... well, insert colourful metaphor here, describing the 8 months where we were affected by lack of power.

Power got to stable supply in September, and so to catch up on things. One of the things we got around to was software versions. Whilst we didn't intent to update the Torque version, and managed to avoid it for a bit, the gLite developers eventually managed to sneak the update past us as part of an ordinary gLite update. Strictly, this didn't affect the batch server, just all the CE's, making them incompatible with the previous version of Torque.

Whilst a clever manoeuvre, reminiscent of Odysseus' Pony, it did leave us with a conundrum of either reverting the gLite update, or running forward with it. Neither were options of good character, but running forward did have some actual documentation; hence it was full speed ahead.

Which worked out well enough. The Torque 2.5.7 packages were set to use Munge, so getting that installed and tested as a first step helped it go smoothly. To preserve compatability in file locations, we used /etc/sysconfig/pbs_mom to put the pbs working directories in the same place as previously - meaning we didn't have to reconfigure any other tools.

What didn't go so smoothly was the memory leak in the server.

Which gave it a runtime of around 36 hours between crashes. Actually, not even crashes - we found that the pbs_server process hit either


12/05/2011 10:19:12;0080;PBS_Server;Req;req_reject;Reject reply code=15012(PBS_Server System error: No child processes MSG=could not unmunge credentials), aux=0, type=AlternateUserAuthentication, from tomcat@svr021.gla.scotgrid.ac.uk

or

10/29/2011 18:11:24;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Cannot allocate memory (12) in send_job, fork failed


and then sat around moaning. Had it crashed hard, then the auto-restart would have caught it. Ho, hum, one for the Fast Fail philosophy there.


By this point, my proof reader is pointing out that I started off talking hardware, and now talking software. Punchline is that the new server that we never got a chance to use has a lot more RAM than the old server. Therefore we wanted to move the server from the old hardware to the new, to give it a lot more RAM space. That won't fix the memory leak, but will mitigate the problem a bit.

Conventionally, this would involve draining the cluster, repositioning the CE's and then starting up everything again. Had we done that, this blog post would be over now.

Instead, we did a rolling update. This let us move things over without having to do a full drain. The biggest problem with a full drain is that, while most of the jobs finish within a shorter period of time that then limit, there are always some that take the full duration. This leaves us with an empty cluster, doing nothing, for 24 hours or so, wainting on a couple of jobs to finish.

So, instead, by moving things in small batches, then we can keep most of the nodes working, and thus get more work out of things. Step zero is to disable cfengine, otherwise it tends to try and 'fix' things part way through.

Step one is to drain a CE, which we did over a weekend, and a small number of nodes, which we put offline on the Sunday morning.

Come Monday, I set up and tested basic operations with the new batch server, and then moved the freed up nodes across to it. Once those were tested (which shook out a couple of issues about versioning of some libs), point the CE at the new batch server, and then run a test job though it. (It turns out that Atlas are fast enough to sneak some pilots through a 2 minute window for a test job. However, only a few, so they actually functioned as effective tests, without compromising the site if they failed).

After that, it's time to offline another CE, and then some more nodes, and start moving nodes over when they were empty. In the end I scripted this:


#!/bin/sh

NODE=$1
RUNNING=$(qstat -n -1 | grep $NODE | wc --lines)

if [ "x${RUNNING}" != "x0" ]
then
echo $NODE: Still $RUNNING jobs going, skipping
exit 2
fi

CORES=$(qmgr -c "print node ${NODE}" | grep "np = " | cut -d= -f2)

FROM=svr666
TO=svr999

echo $NODE: Moving to ${TO} with ${CORES} cores

ssh ${TO} "~/addNode.sh ${NODE} ${CORES}"

ssh ${NODE} "service pbs_mom stop"
scp config.mom.svr666 ${NODE}:/var/spool/pbs/mom_priv/config
ssh ${NODE} "service pbs_mom start"

ssh ${FROM} "~/deleteNode.sh ${NODE}"


In theory one can run qmgr remotely, rather than ssh-ing to the batch servers and running a script. In practice, with the different versions of Torque, I couldn't get that to work. Note the automation of the mom config switch as well; and that this script checks that the node is empty.

This reduced the gradual move of nodes to a process of croning the script, and offlining nodes occasionally.

The net result was that we were operating at around 80% capacity for 48 hours, and it was all rather uneventful - in a good way. The final step was to update cfengine config and re-enable it.

One of the plus points of the above script is that it should be simple to adapt to two distinct batch systems; which means if we end up moving away from Torque, we should be able to do that without downtime too.

No comments: