Wednesday, April 08, 2009

how to break your worker nodes in one easy step!

A short tale of how to break your cluster in an almost untraceable manner.

At ScotGrid we have a group of local users called nanocmos. The nano's have been trying to get afs working with gssklog for some time. So in an effort to help them out I have been investigating what is required on our end in order to get gssklog working. Mike had previously installed afs on our worker nodes and this has worked for a while now. Our local Nano user had been reporting the following missing library: code> After I did some digging it appears that the although the ScotGrid machines are 64bit we only had the 32bit worker node packages installed and subsequently only the 32bit version of vdt_globus_essentials. So the solution looked like installing the 64 bit version of vdt_globus_essentials package on the UI and worker nodes. This was carried out in our pre-production environment and job submission was successful after the install. This was then rolled out into production.

What happened next... well first off we had a power cut. This masked anything that may have happened immediately. When we were back on line we started to fail the replica management tests using lcg-utils on the CE SAM tests.

After many hours scratching our heads. We decided to roll back the change that I had rolled out earlier in the day.

rpm -e --nodeps vdt_globus_essentials-VDT1.6.1x86_64_rhas_4-7.x86_64
rpm -i http://master.beowulf.cluster/gLite/R3.1/generic/sl4/i386/RPMS.updates/vdt_globus_essentials-VDT1.6.1x86_rhas_4-6.i386.rpm

et voila, this fixed the problem.

It appears that the vdt_globus_essentials 64bit rpm has the 32bit libaries included. Therefore, when I rpm installed it this overwrote the currently installed 32bit versions. A quick md5sum later showed the two 32bit versions to be different! This broke the lcg-utils!

So the moral of the story for me anyway and a lesson learned is don't trust the contents of an rpm when you are installing different versions of libraries onto a system. The route I should have taken was to unpack the rpm:

rpm2cpio vdt_globus_essentials-VDT1.6.1x86_64_rhas_4-7.x86_64.rpm | cpio -idmv --no-absolute-filenames

and create a wrapper script to gssklog. This appended the 64bit libraries onto the LD_LIBRARY_PATH and now it all works (well if I actually was a registered user on their afs server).

-bash-3.00$ /afs/ -server
Unable to get token: code = 1: Unable to map to AFS user.

No comments: