ScotGrid: UI

Showing posts with label UI. Show all posts

Friday, January 29, 2010

ScotGrid's shrink wrapped UI

In an effort to reduce overhead for new external users who wish to submit to Glasgow I have created a shrink wrapped gLite UI. This comes in the form of a slimmed down Virtual Box SL5 image with pre-installed gLite UI.

The hope is that users from external institutions who wish to run jobs on the EGEE grid and more specifically at ScotGrid will be able to take advantage of this. This is of particular importance for external users of Lumerical's FDTD who are primarily engineers who just want to run the software rather than install an SL5 gLite UI first. The end goal is extending this to help all our users get up and running as quickly as possible.

This will come pre-installed with Glasgow's submission tools such as gqsub and other more specific user scripts. Those wishing a link to download the VM should drop us an email.

Details of the VM image, setting up the UI are available at the wiki.

I am still at a loss how CERN managed to get their VirtualBox image down to 500GB!

Monday, July 06, 2009

Installing (and fixing) a gLite Tar UI on SL5

First, a little background.

The UI machine is the gLite term for the machine from which you submit jobs (and monitor, receive output etc). This is analogous to the submit machine in Condor, and the head node for a local cluster - except that with the Grid, there is no reason that you can't submit on one UI, monitor from another and collect output on a third. No reason - except perhaps for keeping one's sanity.

Whilst most of the Grid servers are normally dedicated machines, occasionally given over to more than one Grid task, but only doing Grid tasks, the UI is a clear contender for being placed on machine that already have another purpose. In this instance, we have a group of users that have their own cluster, and occasionally off load some computations onto the Grid. It would be ideal if they could submit to either their local cluster or the Grid from the same machine. Cluster head nodes aren't too portable, so the obvious approach is to turn their existing head node into a gLite UI.

Fortunately, the gLite developers forsaw this possibility, and the UI package is available in a single blob that can be installed for an individual user. So that's what I've done - but there's a few caveats, and a couple of bugs to work around.

The tar UI I used was the gLite 3.2.1 production release. This is still early in the 3.2 life cycle, and not all services are available at 3.2, so there might be a few teething issues here, interacting with the older services. At Glasgow we don't have any 3.0 services, which is good, as they're really unsupported.

On to the install: Download the two tarballs, and unpack into a directory (why 2 tarballs, one tarball aught to be enough for anyone). I then promptly fell of the end of the documentation - which assumes that you already know a lot about gLite.

What you have to do it produce a file (the site-info.def) that gives some high level details of what the UI needs to know to work. This file can be created anywhere (I put it in the same directory I unpacked the tarballs into), as you always gives it's path to yaim, the tool that uses it.

The first thing you need to put in is the 4 paths listed on the wiki page. Then you need a few other things:

BDII_HOST=svr019.gla.scotgrid.ac.uk
MON_HOST=svr019.gla.scotgrid.ac.uk
PX_HOST=lcgrbp01.gridpp.rl.ac.uk
WMS_HOST="svr022.gla.scotgrid.ac.uk svr023.gla.scotgrid.ac.uk"
RB_HOST=$WMS_HOST

The BDII host is where the UI gets it's information from - this should be a 'top level' BDII, not a site BDII. None of have the faintest clue why it needs the MON host - that's something I'll dig into later. The PX host is the MyProxy server to use by default. That one should be good for anywhere in the UK. The WMS host is the replacement for the deprecated (but still needed) RB hosts, and points to the WMS to be used for submission (by default).

One thing I found I needed that wasn't documented was a SITE_NAME. I just put the hostname in there - it doesn't appear to be used, but yaim complains if it's not there.

The last thing needed is a list of the VO's to be supported on the UI. When deploying a tar UI this will normally be a very small list - one or two I would expect. Therefore I choose to place them inline. There is a mechanism to put the VO specification in a separate directory, which is used for shared UI machines.

VOS="vo.scotgrid.ac.uk"

VO_VO_SCOTGRID_AC_UK_VOMS_SERVERS="vomss://svr029.gla.scotgrid.ac.uk:8443/voms/vo.scotgrid.ac.uk"
VO_VO_SCOTGRID_AC_UK_VOMSES="'vo.scotgrid.ac.uk svr029.gla.scotgrid.ac.uk 15000 /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=svr029.gla.scotgrid.ac.uk/Email=grid-certificate@physics.gla.ac.uk vo.scotgrid.ac.uk'"
VO_VO_SCOTGRID_AC_UK_VOMS_CA_DN="'/C=UK/O=eScienceCA/OU=Authority/CN=UK e-Science CA'"

VO specification is in two parts - first we have to list the VO's (space separated list), and then , for each VO, give the VOMS server that defines the membership of the VO, and the certificate DN for the VOMS server. Note that the vo name gets translated to UPPER CASE and all the dots in it become underscores (a fact that's somewhat underdocumented, and results in a complaint about a syntactically invalid site-info.def, and no other message ...)

Once that's all in place, it's time to run yaim to configure things (from the dir I unpacked into):

./glite/yaim/bin/yaim -c -s site-info.def -n UI_TAR

Slight problem with installing certificates: By default these go into /etc/grid-security/certificates, but I'm not running as root. As a local user (for the initial testing), I need to tell yaim where to put them instead. In the site-info.def:

X509_CERT_DIR=${INSTALL_ROOT}/certificates

and make that directory, and re-run the yaim command. Chuntering along for a bit, and then finished with no errors - I did get a couple of warnings, but nothing that looked like a problem in this case.

Last step - testing. First, load up the installed software:

$GLITE_EXTERNAL_ROOT/etc/profile.d/grid-env.sh

and install my certificate on there.

lcg-infosites ... works
voms-proxy-* ... works
glite-wms-job-submit ... Boom!

glite-wms-job-submit: error while loading shared libraries: libboost_filesystem.so.2: wrong ELF class: ELFCLASS32

Hrm. Looks like a 32/64 bit problem. Some pokage later, and it turns out that the shell setup script supplied points only to the $GLITE_EXTERNAL_ROOT/usr/lib directory - and not the lib64, containing the needed libs. A quick hack onto the grid-env.sh, and that's rectified. Now:

[scotgrid@golem ~]$ glite-wms-job-submit -a minimaltest.jdl
glite-wms-job-submit: error while loading shared libraries: libicui18n.so.36: cannot open shared object file: No such file or directory

This turns out to be the International Components for Unicode (at least, I think so). The particularly interesting point about this is that the only references I can find to these libraries on SL include one from this very blog and they are all about Adobe Acrobat Reader... because that's the most common software that uses it.

I grabbed the RPM from http://linux1.fnal.gov/linux/scientific/5x/x86_64/SL/, and added it to $GLITE_EXTERNAL_ROOT/usr/lib64 by:

cd $GLITE_EXTERNAL_ROOT
rpm2cpio libicu-3.6-5.11.2.x86_64.rpm | cpio -i

And, finally:

[scotgrid@golem ~]$ glite-wms-job-submit -a minimaltest.jdl

Connecting to the service https://svr022.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server

====================== glite-wms-job-submit Success ======================

The job has been successfully submitted to the WMProxy
Your job identifier is:

https://svr023.gla.scotgrid.ac.uk:9000/VW96yirZ4gG6jVXjx9UwBg

==========================================================================

Jobs submitted from a pure 64 bit SL5 system. Note that the separator character has changed, from being a line of '*' to a line of '='.

After chuntering away for a while,


[scotgrid@golem ~]$ glite-wms-job-output https://svr023.gla.scotgrid.ac.uk:9000/VW96yirZ4gG6jVXjx9UwBg

Connecting to the service https://svr022.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server

Error - Operation Failed
Unable to retrieve the output

[scotgrid@golem ~]$ cd /tmp/jobOutput/
[scotgrid@golem jobOutput]$ ls
scotgrid_VW96yirZ4gG6jVXjx9UwBg

Which is a known problem - the UI reports that collecting job output failed, but it does succeed. If you don't need to get the OutputSandbox from the job (e.g. it's all written to an SE), then this isn't a problem.

Now to post some bug reports on the tar UI package... (Update: This is now bug number 52825 for the configuration and bug 52832 for the ICU package)

Wednesday, April 02, 2008

Only one bite at the cherry...

I have modified the default RetryCount on our UIs to now set zero retries. Automatic retries were actually working quite well for us when we were losing a lot of nodes to MCE errors (in the days before the upgrade to SL4, x86_64) - users' jobs would automatically rerun if they got lost and there was no need for them to worry about failures. However, recently we see users submitting more problematic jobs to the cluster - some which fail to start at all, some which run off into wallclock limits, others which stall half way through. Often we have to gut the batch system with our special spoon and in this case having to do it four times because the RB/WMS keeps resubmitting the job is less then helpful.

For once cfengine's editfiles stanza was useful and a simple:

ui::
{ /opt/glite/etc/glite_wmsui_cmd_var.conf
ReplaceFirst "RetryCount\s+=\s+[1-9];" With "RetryCount = 0;"
}
{ /opt/egd/etc/edg_wl_ui_cmd_var.conf
ReplaceFirst "RetryCount\s+=\s+[1-9];" With "RetryCount = 0;"
}

got the job done.

Tuesday, September 25, 2007

SL4 x86_64 UI now Availiable

I reinstalled the site's svr020 UI on Saturday. This involved an incredible amount of pain related to the bizarre inability of SL4 to properly install GRUB on a linux software RAID partition. Although the machine would install absolutely fine, on reboot it would just halt after the GRUB prompt.

In the end, after tearing my hair out several times (I was working from home on Friday) and trying as many tricks as I could (I even DBANed the disks), I had to retreat from running software RAID1, and fallback to running only on one of the SCSI disks. (As an

That finally gave me a base SL4 install I could work with.

After that, the installation of the SL4 32bit UI was easy - running through cfengine (one little caveat was that the gsisshd restart would kill off the normal sshd on port 22, so that has been disabled).

Then I found that job submission didn't work, because it relies on a 32bit python/C module and the default python is 64bit now. The advice on ROLLOUT was to have a 32bit python higher in the path than /usr/bin/python. This seemed rather bad advice to me, as we'd like to really have 64 bit python - it is a 64 bit system after all! So, instead I decided to change the magic bang path to specifically reference /usr/bin/python32. Initially I tried to use cfengine's editfiles facility to do this. However, anything which is not a completely trivial modification is rather horrendous to do in cfengine (it reminded me of ed, actually), so I eventually abandoned this, and instead wrote a 3 line perl special in the cfengine script sources, and this is called after the RPMs are installed. (In addition to changing the python interpreter it disables the tk graphical interface, for which we don't have any users anyway.)

Finally, I upgraded ganga, and this went fine - ganga runs quite happily with 64 bit python (normally this wouldn't deserve special note, but in the grid world flowers and champagne are in order).

ScotGrid