Monday, April 20, 2009

Victim of our own success

I was wondering why Glasgow was not getting more activated ATLAS production jobs and eventually tracked it down to the fact that our cache disk area, ATLASPRODDISK, was almost full with only 170GB free space left. Panda was very sensibly not sending us more jobs until we had somewhere to put the outputs!

I quick whirl with dpm-updatespace and I increased PRODDISK from 2TB to 5TB, which should see us good.

I also discovered that Durham was missing some ATLAS releases, which was why they were missing out on ATLAS jobs today - installations now triggered.

Thursday, April 16, 2009

activiting a VO on a WMS in four easy steps

We have been asked to add support for the camont VO to our WMS.
Since we only support a limited number of VO through our WMS, here is the recipe for adding a new one in the smallest number of steps.

1. edit the following file vim /opt/glite/yaim/etc/services/glite-wms
2. /opt/glite/yaim/bin/yaim -r -s /opt/glite/yaim/etc/site-info.def -n glite-WMS -f config_vomsmap
3. /opt/glite/yaim/bin/yaim -r -s /opt/glite/yaim/etc/site-info.def -n glite-WMS -f config_vomses
4. /opt/glite/yaim/bin/yaim -r -s /opt/glite/yaim/etc/site-info.def -n glite-WMS -f config_glite_wms

Wednesday, April 08, 2009

torque, consumable resources and glite

I finally got round to install a torque/maui instance on my pre-production cluster at Glasgow. This completes the mini cluster as I had previously installed an SL5 gLite3.2 worker node before going to GridPP collaboration meeting. I will update this blog with a wiki page on what was involved with Torque but there was quite a few gotchas involved in getting it running especially when you have to reconfigure all your nodes from using the production batch system but I got there in the end.

This will be a great help to us for experimentation as we have installed some optics software called Lumerical on our cluster that requires licensing similar to MatLab. This actually works quite nicely when you follow the docs and all you need is to amend your qsub to use a consumable resource ie. a licence. However, it doesn't work so well when you are using EGEE middleware! As now you have three levels of indirection... wms(jdl), lcg-ce(rsl)/cream, torque(qsub). So I think we may need to hack the job manager on our ce's in order to hand craft the consumable resource. This then begs the question, do you do it based on the VO? Meaning one VO per software application - not very flexible or do you do it based on something else that you pass from the JDL, or a combination of both? Some experimenting required.

cream broken pipes

I was just updating our pre-production cream set-up for testing it with a newly installed pre-production torque instance and it ceased to submit any jobs.

So if you ever find that when submitting a job to cream you see the following....

2009-04-07 16:11:31,265 FATAL - MethodName=[jobRegister] Timestamp=[Tue 07 Apr 2009 16:11:31]
ErrorCode=[0] Description=[system error] FaultCause=[cannot write the job wrapper (jobId = CREAM600033116)!
The problem seems to be related to glexec which reported: Broken pipe]


It looked like re-running yaim on the node had re-configured something incorrectly. Checking /var/log/messages it actually looked like glexec could no longer write to a log file.

dev011:/var/log# tail -f messages
Apr 7 16:40:55 dev011 glexec[11697]: Error in LCAS/LCMAPS, rc = 107
Apr 7 16:40:55 dev011 glexec[11697]: LCAS failed, see '/var/log/glite/glexec_lcas_lcmaps.log' for more info.
Apr 7 16:43:33 dev011 glexec[12065]: glexec pid: 12065
Apr 7 16:43:33 dev011 glexec[12065]: lcas_log_open(): Cannot open logfile /var/log/glite/glexec_lcas_lcmaps.log


It appears that cream has a default glexec log location set in the glexec.conf which is either in /opt/glite/var/log/glexec_lcas_lcmaps.log or /var/log/glite/glexec_lcas_lcmaps.log. This must have changed!

This must directory must exist or else cream will not start! Something to remember in future!

how to break your worker nodes in one easy step!

A short tale of how to break your cluster in an almost untraceable manner.

At ScotGrid we have a group of local users called nanocmos. The nano's have been trying to get afs working with gssklog for some time. So in an effort to help them out I have been investigating what is required on our end in order to get gssklog working. Mike had previously installed afs on our worker nodes and this has worked for a while now. Our local Nano user had been reporting the following missing library: code>libglobus_gssapi_gsi_gcc64dbgpthr.so.0. After I did some digging it appears that the although the ScotGrid machines are 64bit we only had the 32bit worker node packages installed and subsequently only the 32bit version of vdt_globus_essentials. So the solution looked like installing the 64 bit version of vdt_globus_essentials package on the UI and worker nodes. This was carried out in our pre-production environment and job submission was successful after the install. This was then rolled out into production.

What happened next... well first off we had a power cut. This masked anything that may have happened immediately. When we were back on line we started to fail the replica management tests using lcg-utils on the CE SAM tests.

After many hours scratching our heads. We decided to roll back the change that I had rolled out earlier in the day.

rpm -e --nodeps vdt_globus_essentials-VDT1.6.1x86_64_rhas_4-7.x86_64
rpm -i http://master.beowulf.cluster/gLite/R3.1/generic/sl4/i386/RPMS.updates/vdt_globus_essentials-VDT1.6.1x86_rhas_4-6.i386.rpm

et voila, this fixed the problem.

It appears that the vdt_globus_essentials 64bit rpm has the 32bit libaries included. Therefore, when I rpm installed it this overwrote the currently installed 32bit versions. A quick md5sum later showed the two 32bit versions to be different! This broke the lcg-utils!

So the moral of the story for me anyway and a lesson learned is don't trust the contents of an rpm when you are installing different versions of libraries onto a system. The route I should have taken was to unpack the rpm:

rpm2cpio vdt_globus_essentials-VDT1.6.1x86_64_rhas_4-7.x86_64.rpm | cpio -idmv --no-absolute-filenames

and create a wrapper script to gssklog. This appended the 64bit libraries onto the LD_LIBRARY_PATH and now it all works (well if I actually was a registered user on their afs server).

-bash-3.00$ /afs/nesc.gla.ac.uk/software/amd64_linux26/afs/bin/gssklog.bin -server jupiter.nesc.gla.ac.uk
Unable to get token: code = 1: Unable to map to AFS user.