Monday, July 06, 2009

Installing (and fixing) a gLite Tar UI on SL5

First, a little background.

The UI machine is the gLite term for the machine from which you submit jobs (and monitor, receive output etc). This is analogous to the submit machine in Condor, and the head node for a local cluster - except that with the Grid, there is no reason that you can't submit on one UI, monitor from another and collect output on a third. No reason - except perhaps for keeping one's sanity.

Whilst most of the Grid servers are normally dedicated machines, occasionally given over to more than one Grid task, but only doing Grid tasks, the UI is a clear contender for being placed on machine that already have another purpose. In this instance, we have a group of users that have their own cluster, and occasionally off load some computations onto the Grid. It would be ideal if they could submit to either their local cluster or the Grid from the same machine. Cluster head nodes aren't too portable, so the obvious approach is to turn their existing head node into a gLite UI.

Fortunately, the gLite developers forsaw this possibility, and the UI package is available in a single blob that can be installed for an individual user. So that's what I've done - but there's a few caveats, and a couple of bugs to work around.

The tar UI I used was the gLite 3.2.1 production release. This is still early in the 3.2 life cycle, and not all services are available at 3.2, so there might be a few teething issues here, interacting with the older services. At Glasgow we don't have any 3.0 services, which is good, as they're really unsupported.

On to the install: Download the two tarballs, and unpack into a directory (why 2 tarballs, one tarball aught to be enough for anyone). I then promptly fell of the end of the documentation - which assumes that you already know a lot about gLite.

What you have to do it produce a file (the site-info.def) that gives some high level details of what the UI needs to know to work. This file can be created anywhere (I put it in the same directory I unpacked the tarballs into), as you always gives it's path to yaim, the tool that uses it.

The first thing you need to put in is the 4 paths listed on the wiki page. Then you need a few other things:
BDII_HOST=svr019.gla.scotgrid.ac.uk
MON_HOST=svr019.gla.scotgrid.ac.uk
PX_HOST=lcgrbp01.gridpp.rl.ac.uk
WMS_HOST="svr022.gla.scotgrid.ac.uk svr023.gla.scotgrid.ac.uk"
RB_HOST=$WMS_HOST
The BDII host is where the UI gets it's information from - this should be a 'top level' BDII, not a site BDII. None of have the faintest clue why it needs the MON host - that's something I'll dig into later. The PX host is the MyProxy server to use by default. That one should be good for anywhere in the UK. The WMS host is the replacement for the deprecated (but still needed) RB hosts, and points to the WMS to be used for submission (by default).

One thing I found I needed that wasn't documented was a SITE_NAME. I just put the hostname in there - it doesn't appear to be used, but yaim complains if it's not there.

The last thing needed is a list of the VO's to be supported on the UI. When deploying a tar UI this will normally be a very small list - one or two I would expect. Therefore I choose to place them inline. There is a mechanism to put the VO specification in a separate directory, which is used for shared UI machines.
VOS="vo.scotgrid.ac.uk"

VO_VO_SCOTGRID_AC_UK_VOMS_SERVERS="vomss://svr029.gla.scotgrid.ac.uk:8443/voms/vo.scotgrid.ac.uk"
VO_VO_SCOTGRID_AC_UK_VOMSES="'vo.scotgrid.ac.uk svr029.gla.scotgrid.ac.uk 15000 /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=svr029.gla.scotgrid.ac.uk/Email=grid-certificate@physics.gla.ac.uk vo.scotgrid.ac.uk'"
VO_VO_SCOTGRID_AC_UK_VOMS_CA_DN="'/C=UK/O=eScienceCA/OU=Authority/CN=UK e-Science CA'"
VO specification is in two parts - first we have to list the VO's (space separated list), and then , for each VO, give the VOMS server that defines the membership of the VO, and the certificate DN for the VOMS server. Note that the vo name gets translated to UPPER CASE and all the dots in it become underscores (a fact that's somewhat underdocumented, and results in a complaint about a syntactically invalid site-info.def, and no other message ...)

Once that's all in place, it's time to run yaim to configure things (from the dir I unpacked into):
./glite/yaim/bin/yaim -c -s site-info.def -n UI_TAR
Slight problem with installing certificates: By default these go into /etc/grid-security/certificates, but I'm not running as root. As a local user (for the initial testing), I need to tell yaim where to put them instead. In the site-info.def:
X509_CERT_DIR=${INSTALL_ROOT}/certificates
and make that directory, and re-run the yaim command. Chuntering along for a bit, and then finished with no errors - I did get a couple of warnings, but nothing that looked like a problem in this case.

Last step - testing. First, load up the installed software:
$GLITE_EXTERNAL_ROOT/etc/profile.d/grid-env.sh
and install my certificate on there.

lcg-infosites ... works
voms-proxy-* ... works
glite-wms-job-submit ... Boom!
glite-wms-job-submit: error while loading shared libraries: libboost_filesystem.so.2: wrong ELF class: ELFCLASS32
Hrm. Looks like a 32/64 bit problem. Some pokage later, and it turns out that the shell setup script supplied points only to the $GLITE_EXTERNAL_ROOT/usr/lib directory - and not the lib64, containing the needed libs. A quick hack onto the grid-env.sh, and that's rectified. Now:
[scotgrid@golem ~]$ glite-wms-job-submit -a minimaltest.jdl
glite-wms-job-submit: error while loading shared libraries: libicui18n.so.36: cannot open shared object file: No such file or directory
This turns out to be the International Components for Unicode (at least, I think so). The particularly interesting point about this is that the only references I can find to these libraries on SL include one from this very blog and they are all about Adobe Acrobat Reader... because that's the most common software that uses it.

I grabbed the RPM from http://linux1.fnal.gov/linux/scientific/5x/x86_64/SL/, and added it to $GLITE_EXTERNAL_ROOT/usr/lib64 by:
cd $GLITE_EXTERNAL_ROOT
rpm2cpio libicu-3.6-5.11.2.x86_64.rpm | cpio -i
And, finally:
[scotgrid@golem ~]$ glite-wms-job-submit -a minimaltest.jdl

Connecting to the service https://svr022.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server

====================== glite-wms-job-submit Success ======================

The job has been successfully submitted to the WMProxy
Your job identifier is:

https://svr023.gla.scotgrid.ac.uk:9000/VW96yirZ4gG6jVXjx9UwBg

==========================================================================
Jobs submitted from a pure 64 bit SL5 system. Note that the separator character has changed, from being a line of '*' to a line of '='.

After chuntering away for a while,

[scotgrid@golem ~]$ glite-wms-job-output https://svr023.gla.scotgrid.ac.uk:9000/VW96yirZ4gG6jVXjx9UwBg

Connecting to the service https://svr022.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server

Error - Operation Failed
Unable to retrieve the output

[scotgrid@golem ~]$ cd /tmp/jobOutput/
[scotgrid@golem jobOutput]$ ls
scotgrid_VW96yirZ4gG6jVXjx9UwBg


Which is a known problem - the UI reports that collecting job output failed, but it does succeed. If you don't need to get the OutputSandbox from the job (e.g. it's all written to an SE), then this isn't a problem.

Now to post some bug reports on the tar UI package... (Update: This is now bug number 52825 for the configuration and bug 52832 for the ICU package)

No comments: