Monday, March 12, 2007

Glasgow Update to gLite 3.0r16

This was the first gLite update with significant component changes:

* DPM was upgraded to 1.6.3, with a schema change and a new SRM v2.2 daemon.
* Torque and Maui were upgraded to v2.1.6 and v3.2.6, respectively, from the previously ancient LCG versions.

The DPM upgrade I tackled first. This was fine and I blogged about it on the storage blog. Just be careful to take a db dump before you try it, just in case things go wrong. There's also a strong warning against running automatic updates on gLite server nodes - this is not supported and some people are reporting DPM database corruption on LCG-ROLLOUT.

The torque/maui upgrade I was a little nervous about as I don't feel I greatly understand these components. However, Steve T had said that minor upgrades are ok (we'd been using the Steve T build for the cluster since the start, so we were already on torque v2), so I took the plunge. First I did a single worker, and restarted pbs_mom, so make sure the 2.1.6 mom didn't have trouble talking to the 2.1.5 server - and it didn't. So then I updated all the WNs, before turning my attention to the server.

Here, I did the usual yum -y update first. Then I restarted pbs_server and maui. pbs_server didn't restart cleanly, claiming something was bound to the port. I had a look, but by the time I did there was nothing - I think it was the server being sluggish to exit. So pbs_server then (re)started fine, and I did an extra maui restart to be on the safe side.

A basic check of the batch system (pbsnodes, qstat, diagnose) looked ok.

I have commented Steve's repository out of by yum.conf - we'll now use the "official" gLite build on Steve's advice.

N.B. I still intend to manage to batch system using cfengine, not YAIM - it's a lot more flexible for us to do this, e.g., the new routing queue being the default one.

No comments: