Monday, July 06, 2009

Deflected Cosmic Rays...


This is the second short "when you're good..." post. During the RAL machine room move, we tested distributing ATLAS cosmics AOD and DPD data from CERN->GLASGOW->UK T2s. After some tweaking of the T2 FTS channels at CERN and tinkering in DDM this has worked a charm. Data distrubution in the UK has gone very well throughout the current combined cosmics data taking runs.

This is the first time that we tried circumventing the T1 for such an organised data distribution and it was a real success for the UK, ATLAS and Glasgow.

When you're good, you're Glasgow...


There hasn't been much time to write in the blog recently, STEP09 madness and all. However, it is wonderful to see that Glasgow was the top ATLAS T2 for analysis during the STEP09 challenges. We analysed more than 1.8B events, mostly through panda, with a 98% success rate.

We also took the largest fraction of data of any UK T2, 40%, and succeeded in getting all the data we were sent (we had little anxiety on the final weekend and want to increase our network heardroom for sure).

Sam and I wrote a full report on our experiences and how we used the opportunity to really probe the limits of the current cluster.

For the future, we really have to worry about how to maintain the i/o rate into the CPUs as the number of cores rises.

Installing (and fixing) a gLite Tar UI on SL5

First, a little background.

The UI machine is the gLite term for the machine from which you submit jobs (and monitor, receive output etc). This is analogous to the submit machine in Condor, and the head node for a local cluster - except that with the Grid, there is no reason that you can't submit on one UI, monitor from another and collect output on a third. No reason - except perhaps for keeping one's sanity.

Whilst most of the Grid servers are normally dedicated machines, occasionally given over to more than one Grid task, but only doing Grid tasks, the UI is a clear contender for being placed on machine that already have another purpose. In this instance, we have a group of users that have their own cluster, and occasionally off load some computations onto the Grid. It would be ideal if they could submit to either their local cluster or the Grid from the same machine. Cluster head nodes aren't too portable, so the obvious approach is to turn their existing head node into a gLite UI.

Fortunately, the gLite developers forsaw this possibility, and the UI package is available in a single blob that can be installed for an individual user. So that's what I've done - but there's a few caveats, and a couple of bugs to work around.

The tar UI I used was the gLite 3.2.1 production release. This is still early in the 3.2 life cycle, and not all services are available at 3.2, so there might be a few teething issues here, interacting with the older services. At Glasgow we don't have any 3.0 services, which is good, as they're really unsupported.

On to the install: Download the two tarballs, and unpack into a directory (why 2 tarballs, one tarball aught to be enough for anyone). I then promptly fell of the end of the documentation - which assumes that you already know a lot about gLite.

What you have to do it produce a file (the site-info.def) that gives some high level details of what the UI needs to know to work. This file can be created anywhere (I put it in the same directory I unpacked the tarballs into), as you always gives it's path to yaim, the tool that uses it.

The first thing you need to put in is the 4 paths listed on the wiki page. Then you need a few other things:
BDII_HOST=svr019.gla.scotgrid.ac.uk
MON_HOST=svr019.gla.scotgrid.ac.uk
PX_HOST=lcgrbp01.gridpp.rl.ac.uk
WMS_HOST="svr022.gla.scotgrid.ac.uk svr023.gla.scotgrid.ac.uk"
RB_HOST=$WMS_HOST
The BDII host is where the UI gets it's information from - this should be a 'top level' BDII, not a site BDII. None of have the faintest clue why it needs the MON host - that's something I'll dig into later. The PX host is the MyProxy server to use by default. That one should be good for anywhere in the UK. The WMS host is the replacement for the deprecated (but still needed) RB hosts, and points to the WMS to be used for submission (by default).

One thing I found I needed that wasn't documented was a SITE_NAME. I just put the hostname in there - it doesn't appear to be used, but yaim complains if it's not there.

The last thing needed is a list of the VO's to be supported on the UI. When deploying a tar UI this will normally be a very small list - one or two I would expect. Therefore I choose to place them inline. There is a mechanism to put the VO specification in a separate directory, which is used for shared UI machines.
VOS="vo.scotgrid.ac.uk"

VO_VO_SCOTGRID_AC_UK_VOMS_SERVERS="vomss://svr029.gla.scotgrid.ac.uk:8443/voms/vo.scotgrid.ac.uk"
VO_VO_SCOTGRID_AC_UK_VOMSES="'vo.scotgrid.ac.uk svr029.gla.scotgrid.ac.uk 15000 /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=svr029.gla.scotgrid.ac.uk/Email=grid-certificate@physics.gla.ac.uk vo.scotgrid.ac.uk'"
VO_VO_SCOTGRID_AC_UK_VOMS_CA_DN="'/C=UK/O=eScienceCA/OU=Authority/CN=UK e-Science CA'"
VO specification is in two parts - first we have to list the VO's (space separated list), and then , for each VO, give the VOMS server that defines the membership of the VO, and the certificate DN for the VOMS server. Note that the vo name gets translated to UPPER CASE and all the dots in it become underscores (a fact that's somewhat underdocumented, and results in a complaint about a syntactically invalid site-info.def, and no other message ...)

Once that's all in place, it's time to run yaim to configure things (from the dir I unpacked into):
./glite/yaim/bin/yaim -c -s site-info.def -n UI_TAR
Slight problem with installing certificates: By default these go into /etc/grid-security/certificates, but I'm not running as root. As a local user (for the initial testing), I need to tell yaim where to put them instead. In the site-info.def:
X509_CERT_DIR=${INSTALL_ROOT}/certificates
and make that directory, and re-run the yaim command. Chuntering along for a bit, and then finished with no errors - I did get a couple of warnings, but nothing that looked like a problem in this case.

Last step - testing. First, load up the installed software:
$GLITE_EXTERNAL_ROOT/etc/profile.d/grid-env.sh
and install my certificate on there.

lcg-infosites ... works
voms-proxy-* ... works
glite-wms-job-submit ... Boom!
glite-wms-job-submit: error while loading shared libraries: libboost_filesystem.so.2: wrong ELF class: ELFCLASS32
Hrm. Looks like a 32/64 bit problem. Some pokage later, and it turns out that the shell setup script supplied points only to the $GLITE_EXTERNAL_ROOT/usr/lib directory - and not the lib64, containing the needed libs. A quick hack onto the grid-env.sh, and that's rectified. Now:
[scotgrid@golem ~]$ glite-wms-job-submit -a minimaltest.jdl
glite-wms-job-submit: error while loading shared libraries: libicui18n.so.36: cannot open shared object file: No such file or directory
This turns out to be the International Components for Unicode (at least, I think so). The particularly interesting point about this is that the only references I can find to these libraries on SL include one from this very blog and they are all about Adobe Acrobat Reader... because that's the most common software that uses it.

I grabbed the RPM from http://linux1.fnal.gov/linux/scientific/5x/x86_64/SL/, and added it to $GLITE_EXTERNAL_ROOT/usr/lib64 by:
cd $GLITE_EXTERNAL_ROOT
rpm2cpio libicu-3.6-5.11.2.x86_64.rpm | cpio -i
And, finally:
[scotgrid@golem ~]$ glite-wms-job-submit -a minimaltest.jdl

Connecting to the service https://svr022.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server

====================== glite-wms-job-submit Success ======================

The job has been successfully submitted to the WMProxy
Your job identifier is:

https://svr023.gla.scotgrid.ac.uk:9000/VW96yirZ4gG6jVXjx9UwBg

==========================================================================
Jobs submitted from a pure 64 bit SL5 system. Note that the separator character has changed, from being a line of '*' to a line of '='.

After chuntering away for a while,

[scotgrid@golem ~]$ glite-wms-job-output https://svr023.gla.scotgrid.ac.uk:9000/VW96yirZ4gG6jVXjx9UwBg

Connecting to the service https://svr022.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server

Error - Operation Failed
Unable to retrieve the output

[scotgrid@golem ~]$ cd /tmp/jobOutput/
[scotgrid@golem jobOutput]$ ls
scotgrid_VW96yirZ4gG6jVXjx9UwBg


Which is a known problem - the UI reports that collecting job output failed, but it does succeed. If you don't need to get the OutputSandbox from the job (e.g. it's all written to an SE), then this isn't a problem.

Now to post some bug reports on the tar UI package... (Update: This is now bug number 52825 for the configuration and bug 52832 for the ICU package)

Thursday, July 02, 2009

NGS and EGEE Software Tags

After I reinstalled svr021 (CE) we lost some good work carried out by Andy Elwell to allow NGS software tags to be published along side glite ones through the BDII. I found the original post and re-installed the patch.

Wouldn't it be nice if this was available directly from glite well it is now!

The new script will correctly report information from the ngs-uee-gip-plugin plug-in that the NGS sites use to report applications installed under /usr/ngs.

Jason created the patch, I tested and Laurence Field at CERN has kindly merged the changes into gLite proper and has made an updated RPM available from...

http://etics-repository.cern.ch:8080/repository/download/registered/org.glite/glite-info-generic/2.0.2/noarch/glite-info-generic-2.0.2-5.noarch.rpm

so if you are an NGS affiliate site who runs a glite stack, the rpm above will allow both sets of software tags to be advertised through your BDII. If you try to install ngs-uee-gip-plugin without the new version of glite-info-generic your ngs tags will replace your glite software tags or vice-versa. Now they happily merge rather than replace.

MPI really kicks off at Glasgow

It seems there is a real appetite for MPI codes on ScotGrid at the moment. First off there was Optics running Lumerical's FDTD FDTD and now we have UKQCD running Chroma. Next up is an MPICH install of CASTEP for Solid State Physics. So nearly 2 years after it was first enabled at ScotGrid it is finally seeing it's first tour of duty. Better late than never. We still have some kinks to iron out such as better scheduling of MPI on our cluster but early results from benchmarking are promising even for an ethernet based MPI solution!

Monday, June 22, 2009

Bright and creamy MPI

So, as of the last time MPI was mentioned, it was working. Well, it looks like it wasn't getting much use, because over the year or so, it seems to have fallen into disrepair.

We'd ended up with MPIexec not being installed on the worker nodes, which was blocking the setup of the processor nodes. This even prevented a single process MPI job from running, because that still used MPIexec. In the end, this particular problem was resolved by installing it again (after some careful ramp up to make sure it didn't knock anything else off).

The phrasing of that last sentence is deliberately precise: it turned out that there was another problem lurking in the swamp water that is middleware. In order to test the install of MPIexec, I grabbed a worker node that was out of production for the HEP-SPEC benchmarking. This had, of course, got a new install of the worker node packages, in order to give a consistent platform with other sites.

Experienced Grid hands might just be able to predict what comes next...

After installing MPIexec on that node, and then restricting that node to just our test VO (Maui is awesome for this sort of tweak), we noticed that it wasn't accepting any jobs. Specificially, jobs were arriving, but failing immediatly. Cue finger pointing at MPIexec, and removal of it.

Didn't help.

In the end, Mike resolved this one: An incompatabiliy between the Torque server, and the torque clients with the new Worker node package. Once that was resolved, MPIexec back on, and it was all working fine. Roll out across the cluster, finger crossing and no problems: MPI back in business.

The next step was to actually run MPI jobs - took a couple of attempts with mpi-start, but got there. One problem we have is that the WMS will not send MPI jobs to a site that declares that it is 'torque'. It will only send jobs to sites that declare that declare the LRMS to be 'pbs' or 'lsf'. Given that torque identical to PBS (and more common!), that's a bit silly. This is a known bug, that's been open for 4 years, with a patch available, this is a bit rediculous.

There is a work around, where you can tell the WMS to use a specific LRMS, but you have to also specifiy the target CE - which kind of defeats much of the point of the WMS...

Fortunatly, using the CREAM CE sidesteps most of these issues. Alas, the latest WMS package doesn't work properly with CREAM CE's, so we had to mark our CREAM CE to be 'Special', not 'Production' (effectivly disableing WMS submission to CREAM). Not too big a problem, as we can do direct submission to the CREAM CE for our specific use case, but it's not great in the long term.

Our specific use case is the Lumerical FDTD package, which is installed and working at Glasgow, and has been used by end users. There's some trickyness involed in this, as we're not passing in source code, as mpi-start expects, so I'll write up a bit more how it all fits together at some point.

There might be some Maui fiddling in the imminent future, to assist it to pack MPI jobs on to as few physical machines as possible. The key point is that MPI has been used by end users at Glasgow, which bodes well.

Monday, May 18, 2009

Cream in Action : Local Users & Glexec

At Glasgow have now rolled out a production Cream instance open to only dteam, ops, vo.scotgrid.ac.uk and our newly created vo.optics.ac.uk (to support optics user community and Lumerical's FDTD software). This is svr014 and it looks like CMS are now looking for production Cream instances too. So it may see further action.

One thing that we have done in the past with our local user community is tweak LCMAPS such that specific local users do not use a pool account for their jobs. This was documented in a previous blog post. With cream I thought we should at least attempt to follow the same model for local users.

However, Cream uses glexec with LCMAPS and unfortunately the current version of glexec that comes with the cream CE to map to local users does not work correctly. Thanks to Oscar and Mischa at Nikhef for getting me the right versions of glexec. Here are the versions required to do the following mapping in LCMAPS:


glite-security-glexec-0.6.8-2.slc4.i386.rpm
glite-security-lcmaps-1.4.7-1.slc4.i386.rpm
glite-security-lcmaps-plugins-basic-1.3.10-2.slc4.i386.rpm


these are all in pre-production, so should be out soon in a full cream update.
When these rpm's are installed take care to set the setuid bits as these are lost during the update.

-rwsr-sr-x 1 root glexec 65620 Apr 30 15:56 /opt/glite/sbin/glexec

With these installed the following lcmaps policy can be added/amended to /opt/glite/etc/lcmaps/lcmaps-suexec.db


localuseraccount = "lcmaps_localuseraccount.mod -gridmapfile /usr/local/etc/grid-mapfile-local"

glexec_get_account:
proxycheck -> localuseraccount
localuseraccount -> good | vomslocalgroup
vomslocalgroup -> vomspoolaccount | poolaccount
vomspoolaccount -> good | vomslocalaccount
vomslocalaccount -> good | poolaccount
poolaccount -> good


This policy when moved to be executed first in the list will map any users in the grid-mapfile-local to their local user accounts rather than a pool account.

This 'tweak' seems to work but as I discovered Cream does not really like you doing this and you have to be very careful about the primary group of the user that glexec transforms you to. In Cream /opt/glite/var/cream_sandbox is the directory where the sandbox files are staged on the CREAM CE. This contains a set of directories, created I believe by yaim, named after each of the user/role combination. For example


drwxrwx--- 2 tomcat scotg 4096 Apr 28 12:42 scotg
drwxrwx--- 2 tomcat scotgprd 4096 Apr 28 12:42 scotgprd
drwxrwx--- 2 tomcat scotgsgm 4096 Apr 28 12:42 scotgsgm

dev011:/opt/glite/var/cream_sandbox/scotg# ls -la
total 24
drwxrwx--- 3 tomcat scotg 4096 May 18 14:14 .
drwxrwxr-x 81 tomcat tomcat 4096 May 18 14:10 ..
drwx------ 3 scotg094 scotg 4096 May 18 14:14 C_UK_O_eScience_OU_Glasgow_L_Compserv_CN_douglas_mcnab_vo.scotgrid.ac.uk_Role_NULL_Capability_NULL


Note that these are all owned by the tomcat user and the group is in effect the grid group. So when not using any customised local users when glexec maps you via your voms extension e.g. vo.scotgrid.ac.uk to scotg001 a member of the scotg group and you end up in the scotg directory. Also note the permission of the directory named after your proxy: 700. Meaning only no group read/write permissions on the files contained within the directories.

When using a local user 'tweaked' LCMAPS and my vo.scotgrid.ac.uk proxy gla057/scotg it attempts to stage the input files to scotg but fails like this:

2009-05-18 14:20:18,983 INFO - Sending [/clusterhome/home/gla057/lumerical/paralleltest.fsp] to [gsiftp://dev011.gla.scotgrid.ac.uk/opt/glite/var/cream_sandbox/scotg/C_UK_O_eScience_OU_Glasgow_L_Compserv_CN_douglas_mcnab_vo.scotgrid.ac.uk_Role_NULL_Capability_NULL/CREAM679019987/ISB/paralleltest.fsp]...
2009-05-18 14:20:18,984 DEBUG - ftpclient::put() - dst=[gsiftp://dev011.gla.scotgrid.ac.uk/opt/glite/var/cream_sandbox/scotg/C_UK_O_eScience_OU_Glasgow_L_Compserv_CN_douglas_mcnab_vo.scotgrid.ac.uk_Role_NULL_Capability_NULL/CREAM679019987/ISB/paralleltest.fsp]
2009-05-18 14:20:19,761 ERROR - data_cb() - globus_ftp_client: the server responded with an error
2009-05-18 14:20:19,761 ERROR - done_cb() - globus_ftp_client: the server responded with an error
2009-05-18 14:20:19,764 FATAL - Error sending file [/clusterhome/home/gla057/lumerical/paralleltest.fsp]


This was very confusing at first but when you actually try to do a globus-url-copy or an uberftp which I presume was the CREAM UI is trying to do. You see that it is in fact using your proxy on the client side to map you to a pool account and gsiftp the files to CREAM. From what I could see it was using scotg094. On the server side after applying the local user 'tweak' what it meant was that glexec was actually interacting with cream to build the sandbox directories with a different user. This interaction can be seen here in /opt/glite/etc/glite-ce-cream/cream-glexec.sh


drwx------ 3 gla057 scotg 4096 May 18 14:20 C_UK_O_eScience_OU_Glasgow_L_Compserv_CN_douglas_mcnab_vo.scotgrid.ac.uk_Role_NULL_Capability_NULL


So the gsiftp could not write as the user was no longer the pool user and there are no group write permission on the directories contained within the sandbox. I was able to get round this by relaxing the permissions from 700 to 770 so that members of the same group could effectively read/write/execute to the sandbox directory by patching /opt/glite/etc/glite-ce-cream/cream-glexec.sh. Although I am not entirely happy about this as this could be a security concern.

Now this all worked because my local user gla057 still has a primary group that matches the pool accounts primary group of scotg. However, we have other local users that have a unix group glee. This does not match any of primary groups of the accounts pool available to the VO that they are a member of: nanocmos. I thought the quick win would be to add the nano pool accounts to have an additional group of glee.
But it turns out that globus-url-copy and uberftp etc do not understand the concept of secondary groups when gsiftp'ing. So no luck there.

I think the only possible solution is to create another local VO which can be supported properly through the middleware. A hassle but less of a hack.

Cream in Action : Consumable Resources

I am not sure if you remember this previous post but I said stated that some experimenting was required in order to get consumable resources working with the glite middleware stack.

The reason for this requirement was that for some licensed software (FDTD by Lumerical) that we have installed on our cluster. The documented way to 'consume' a license is to qsub directly to the batch system and pass #software -l FDTD. Not much good when you have an lcg-CE in front of it! After some further investigation it appeared that the only way to get this information through the lcg-CE would be to 'patch' the job manager, so that it added this into the generated PBS script based on RSL that could be sent to it. Unfortunately, from what I could see the RSL schema did not have anything that could be used to fit this software attribute out of the box and patching the job manager was not an ideal going forward.

This looked to only leave the option of creating a specific queue for the software and only allowing members of the new VO to run in this queue. However, it finally struck me to look at the capabilities of cream. With the help of Massimo Sgaravatto and David Rebatto I was able to pass this batch system requirement through the wms, cream and finally end up on the batch system correctly with very little customisation.

in summary:

- set in your JDL (the one used for the glite-ce-job-submit command):
cerequirements = "software==\"FDTD\"";
- Create in the CREAM CE node the file:
/opt/glite/bin/pbs_local_submit_attributes.sh
which has to properly manage the added attribute ("software" in your
case). E.g. for this specific use case it could be something like:


#!/bin/sh
if [ "$software" == "FDTD" ]; then
echo "#PBS -l software=FDTD"
fi


So for any special CE requirements your can handle them by adding them into the submit_attributes.sh file. Cream also has similar capabilities for other batch systems.

As for WMS submission, well when the ice component worked if only for a brief time...
the CErequirements attribute in the JDL sent to CREAM is supposed to be filled by the WMS. This value should basically take into account what it is specified in the Requirements attribute of the JDL and the value specified as CeForwardParameters in the WMS configuration file.

For example, if in your JDL you have:

Requirements= "other.GlueHostMainMemoryRAMSize > 100 && other.GlueCEImplementationName==\"CREAM\"";

and if the conf file of the WMS there is:

CeForwardParameters = {"GlueHostMainMemoryVirtualSize","GlueHostMainMemoryRAMSize","GlueCEPolicyMaxCPUTime"};

The JDL sent by ICE to CREAM should be:

CeRequirements= "other.GlueHostMainMemoryRAMSize > 100";

Unfortunately this doesn't work because of this bug

What you can do now, as a workaround, is specify in the JDL used in the submission to the WMS this cerequirements, e.g.:
cerequirements = "software==\"FDTD\"";
This will be forwarded as it is to CREAM.

This has now been written up in more detail on the cream page.

So to sum it up: Thumbs up for cream.

Saturday, May 09, 2009

Oh my gosh... it's users...

I had been aware of a steady increase in the number of ATLAS user jobs on the cluster in the last few months, which I was delighted to see. I decided to quantify this by querying our accounting database and the users really have arrived.

User jobs since April 1 have consumed 867k hours of wallclock and 686k hours of CPU (80% efficient), c.f. production numbers of 1981k wallclock and 1867k CPU (94% efficient). This means ATLAS users are now consuming 30% of the ATLAS walltime on the cluster.

We've had 235 unique ATLAS users since April and 46 have used more than 1000 hours of wallclock time.

Friday, May 08, 2009

ScotGrid Updates

Glasgow:
  1. Mike enabled pilot roles for both ATLAS and LHCb. He will also work on a parser which digests torque logs and gives the accounting figures in HEP-SPEC2006.
  2. Dug has been tracking down problems and discovering more about the LCG-CEs failure modes than he ever wanted to know (double job running from comms problems all down the line between ganga, wms, CE and batch system).
  3. Stuart has been optimising the cleanup of shared disk areas, which were cramping our style by sending the main nfs server into serious i/o wait for 20 hours in the day.
  4. Sam has installed a small test xrootd server - hopefully I will start running some analysis jobs against it soon to test it out.
  5. We reviewed our fairshares in advance of STEP09 to make sure each group was getting their due. We dropped most of our opportunistic VOs down to 1%.
  6. I discovered a jolly wheeze in Maui to use QOS to help bind the three different ATLAS fairshares into one QOS unit, with its own fairshare. This gives ATLAS sub-groups a fairshare advantage if the total ATLAS usage is under the total ATLAS target. Goes like this:
GROUPCFG[atlas] FSTARGET=10 MAXPROC=2000,2000 QDEF=atlas
GROUPCFG[atlasprd] FSTARGET=21 MAXPROC=2000,2000 QDEF=atlas
GROUPCFG[atlaspil] FSTARGET=11 MAXPROC=2000,2000 QDEF=atlas

QOSCFG[atlas] FSTARGET=42+
Durham:
  1. Running well, but we decided not to implement the ATLAS pilot role (no intention to really support ATLAS analysis - they don't have the disk) and the LHCb pilot role is optional.
  2. Did the HEP-SPEC2006 benchmark on their nodes and got 67.82 for their Xeon L5430s (2.66GHz).
ECDF:
  1. To ward off less efficient user jobs we deleted ATLAS AOD - should see them only doing production for now.
  2. APEL publishing problem fixed.
  3. Steve plans to replace the ancient gLite 3.0 CE with a spiffy new gLite 3.1 one.

Monday, April 20, 2009

Victim of our own success

I was wondering why Glasgow was not getting more activated ATLAS production jobs and eventually tracked it down to the fact that our cache disk area, ATLASPRODDISK, was almost full with only 170GB free space left. Panda was very sensibly not sending us more jobs until we had somewhere to put the outputs!

I quick whirl with dpm-updatespace and I increased PRODDISK from 2TB to 5TB, which should see us good.

I also discovered that Durham was missing some ATLAS releases, which was why they were missing out on ATLAS jobs today - installations now triggered.