Saturday, March 22, 2008

Durham - SL4 Install Success!



Durham took the plunge earlier this week to upgrade the CE, SE and all nodes to SL4.6... with success! After our preparation was delayed slightly due to a small UPS failure, we set about installing cfengine to handle the fabric management. This took a little longer than expected but our patience has paid off and it eases the pain of setup and config of clusters. Using the normal RedHat Kickstart to get a base install of SL4.6, we then hand the rest of the setup to cfengine to work its magic (install extra packages, setup config files, run YAIM etc).

Firstly installing a Worker Node was relatively straight forward. Then came the CE along with torque, PBS and the site BDII setup. Thanks to Graeme for help checking our site was working and publishing as expected.

We unexpectedly hit a firewall issue as I had renamed the CE from the old "helmsley.dur.scotgrid.ac.uk" to "ce01.dur.scotgrid.ac.uk"... though I had preserved the IP address. Not what I expected but our network guys were able to fix the rules and we were operational again.

Then the SE followed very quickly afterwards, cfengine and YAIM working their magic very successfully. The procedure was as simple as 1) dump of the database, 2) install SL4.6, 3) Let cfengine do its stuff for a base install, 4) restore the database, 5) Run YAIM. Simple!

Just one gotcha was trying to change the NFS mounted home directories to be local to the nodes. This fails with an error trying to copy the globus-cache-export files. Due to time constraints we have re-enabled the NFS home dirs... but I'm sure this will be simple to fix and I'll look at it next week.

Fair shares and queue time will need reviewing but in all a busy and successful few days. We're passing SAM tests and I've seen Phenogrid, Atlas and Biomed running jobs. Still the UI and a disk server to do, but with cfengine in place, this should be relatively straight forward and will require no downtime.

No comments: