Tuesday, July 28, 2009

sl5 workers murmurings

Well we have had an sl5 test cluster for a while now but it has ever really seen any action other than the odd random hello world job from testing with cream and the like.
However, as the great sl5 debate raged on we put ourself forward as a test site for atlas along with Oxford, another site with an sl5 cluster for installs of SL4 and future SL5 versions of the atlas software.

However, in order to get my development CE visible to the real world I had to add it into our site bdii. This then attracted ops and dteam jobs as by default they were allowed through the CE. No great shakes and was actually good as it identified problems that had not been seen with simple hello world jobs from within ScotGrid.

The first mishap was a networking issue where the jobs could arrive but couldn't get their job wrapper and payload as most of our workers are NAT'd. Except my development one. A simple fix once we worked out what was wrong.

Two other problems were encountered. Firstly that CE-sft-lcg-rm-free test went into a warn state as the glite-WN package no longer pulls in ldapsearch. This is fixed by installing openldap-clients from sl-base.

Secondly, the many of the jobs that actually did run through the system encountered an error on CE-sft-brokerinfo with something like: error while loading shared libraries: libclassad_ns.so.0: cannot open shared object file: No such file or directory
After some googling, this bug is known about and has been fixed. The fix is adding gridpath_prepend "LD_LIBRARY_PATH" "/opt/classads/lib64/" to /etc/profile.d/grid-env.sh However, at Glasgow we control grid-env.sh though cfengine so I needed to make the appropriate change there too.

After going through this over the last few days I stumbled across Ewan's page as he had encountered the exact same issues. So take heed and do a spot of googling first!

There is also a metapackage available for Sl5 glite3.2 WN's this should hopefully contain all the required dependencies. This is located here. The gotcha with this is that you have to install it with yum localinstall or stick it in a yum repo as rpm -i doesn't work.

I have also just compared what is installed from this against the Atlas SL5 page and there were 4 packages missing: compat-gcc-34-g77, compat-libgcc-296, compat-libstdc++-296, ghostscript-8.15.2

So currently we have ops/dteam jobs running and passing. Atlas software jobs running, completing but not successfully working. More digging is required and I will keep you posted.

1 comment:

Christos Triantafyllidis said...

Hey ScotGriders,

interesting post! Have you filed any bug report on your findings?
I mean some of them are very trivial to be solved (i.e. the openldap-clients dependance) but MAYBE gLite developers are not aware of them.

Regards,
Christos