Thursday, September 24, 2009

gqsub at EGEE09

Just a short note from the EGEE 09 conference. It's been very gratifying to have had so much interest in gqsub at the conference - I even had emails about it scant hours after the poster was put up (and before the offical poster session!).

In response to the comments recieved, I've put a roadmap of planned features up on the gqsub page, which gives an idea of where it's headed.

In addition, v 1.2.0 is out, which implements auto staging back of output. This means that in cases where there is not a shared filesystem between the UI and the worker node, but there is GridFTP server on the UI, then gqsub will pull out the JDL tricks we used earilier with the Lumerical deployment. This results in the illusion of a shared filesystem - the job is submitted, and the output appears in the right places as if it was done in a shared filesystem.

Wednesday, September 23, 2009

torque submit filters

After debating whether to add node properties for SL4 and SL5 into the job managers for both cream and the lcg-ce I read Derek's post from RAL about using submit filters. So I thought I would have a go and see if I could tweak the node specification, keep the number of nodes requested intact for MPI and add additional property for the particular CE. Turns out its easy to implement but as usual there is some wierdness. You should be able to write your filter in whatever language you like and just specify the torque.cfg i.e.

Here is a simple example in bash:
/usr/local/sbin# cat torque_submit_filter.sh

#!/bin/sh
while read i
do
if [[ $i =~ "^#PBS -l nodes=[0-9]" ]]
then
export i="${i}:SL4"
fi
echo $i
done


/var/spool/pbs# cat torque.cfg

SUBMITFILTER /usr/local/sbin/torque_submit_filter.sh


This works with cream but not with the lcg-ce.

So lets try again but this time in perl:

/usr/local/sbin# cat torque_submit_filter.pl

#!/usr/bin/perl -w

use strict;

# Echo all other input
while ()
{
# By default just copy the line.
my $line = $_;

if ($line =~ m/^#PBS -l nodes=[0-9]/)
{
chomp($line);
$line = $line . ":SL5\n";
}

print ($line);
}


Now this works with both cream and lcg-ce! Obviously you can do whatever takes your fancy to the qsub input and make it more intelligent.

A word of warning. We used the same queues for both CE's which meant that SL4 and SL5 resources were indistinguishable to users unless they used OS specific CE requirements. We ended flooded on the SL4 queues, with lots of free slots on the SL5 queues. So in the end we have created a new set of queues for the SL4 CE. Hopefully this will be explicit enough for users to target the correct CE.

Tuesday, September 22, 2009

Fun with 5!

So, the SL5 migration was done last week, and as promised at yesterday's dteam meeting I am posting our problems so that other sites can watch out for similar issues (although some of these are deeply related to the way we do things at Glasgow).

First though, the successes:
  1. 1800 cores running SL5
  2. DPM headnode upgraded to SL5
  3. Torque server upgraded to SL5, running torque 2.3.6, maui 3.2.6p21
  4. /atlas/uk voms group supported with a separate fairshare
Now, the list of problems:

1. We introduced a new python script, gridAccounts.py, to generate pool accounts, retiring the venerable, but incomprehensible, genaccts.pl script we had before (Andy Elwell wrote that and his comment was "OK - I give up with python as I need this NOW..."; my retort was "I HATE PERL SO MUCH. IT'S A SHIT LANGUAGE.", but I had never found the time to rewite it until now). The new script reads standard config files, so it's a lot easier to manage, understand and extend. However, all change is (a bit) dangerous and the new script initially had groups in the wrong order in yaim's user.conf, which caused the groupmapfile to be wrong. This then caused all jobs to fail - the uid/gid of the gridftp session did not match the uid/primary gid of the user and gridftp does not like that at all.

(The reason we have to write users.conf is because we still get yaim to do a lot, although we manage all accounts through cfengine; yaim relies on this file to configure various other aspects of the system, such as grid/group mapfiles.)

2. We were trying to support the /atlas/uk VOMS group as a separate entity. This is simple in theory (!), you're looking for the following entries in voms-grid-mapfile:

"/atlas/uk/Role=NULL/Capability=NULL" .ukatlas
"/atlas/uk" .ukatlas

and this in groupmapfile:

"/atlas/uk/Role=NULL/Capability=NULL" atlasuk
"/atlas/uk" atlasuk

If we were managing these files directly, it would have been no problem. However, convincing YAIM to do this was far from easy. This is not helped by the fact that YAIM is now utterly incomprehensible in many ways (have a look at yaim/utils/users_getvogroup if you don't believe me). Finally we hit on the correct recipe, which is to have these accounts in users.conf, with a new "special" defined:

201601:ukatlas001:201040,201000:atlasuk,atlas:atlas:uk:
201602:ukatlas002:201040,201000:atlasuk,atlas:atlas:uk:
201603:ukatlas003:201040,201000:atlasuk,atlas:atlas:uk:
...

with this line added to groups.conf:

"/VO=atlas/GROUP=/atlas/uk":::uk:

Aside: Sometimes I wonder if YAIM has outgrown its usefulness. From something we could understand and tweak easily it's now a sed|awk|cut|sort|tail black box monster, which uses a computerised format for configuration files. c.f. the configuration we have for our own scripts:

[someuser]
dn = /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=some user
uid = 4832
home = /clusterhome/home/someuser
group = atlas
tier25 = True
vo = gla

And trying to do grid configuration manipulations in a language which doesn't have dictionaries is just ridiculous.

Maybe we'll need to wean ourselves off it eventually?

3. Information publishing on svr018 was broken after the upgrade. There was a cryptic reference to a required 'dpminfo' user which Sam had made in his notes. Adding this user did seem to make things work, though it's not at all clear why. Hopefully Sam will enlighten us later. In passing, note that the resource BDII on the service nodes seems to be 'protected' now, so attempts to reach it from 'outside' fail. This is new behaviour and lost us some time in debugging.

4. Terrible trouble was caused by upgrading the torque server to SL5. Using the SteveT build of torque server (http://skoji.cern.ch/sa1/centos5-torque/) seemed to cause grave problems with moms crashing on the worker nodes. Downgrading the moms to the torque 2.3.0 didn't work as the jobs files (/var/spool/pbs/mon_priv/jobs) seemed to be in incompatible formats and led to crashing moms plus a very confused torque server. Cleaning out all jobs seemed to not work either. The final solution was to rebuild torque 2.3.6 on SL5 - this gave a consistent and compatible server/mon pairing.

A small side effect though, was the the rebuilt maui had a different 'secret' in it, so I have had to hack the info provider on the SL4 CEs to use the --keyfile= argument in the maui client commands. (That's such a stupid 'feature'.)

5. Once we were out of downtime, random transfers to the DPM were failing. Eventually we tracked to the reduction in the number of pool accounts for atlasprd. There was no sync between the passwd fila and to the /etc/grid-security/gridmapdir pool account list, of course, so gridftp was was throwing a "530 Login incorrect. : No local mapping". We realised that
    1. /etc/passwd should be handled better on nodes which need to map pool accounts.
    2. For the moment never reduce the number of accounts!
    3. N.B. on the CEs the gridmapdir is shared, so maintenance probably needs to be delegated
    4. If we remove a pool account mapping then you have to remove the link from any DNs to this mapping as well (look for DN filenames with only 1 hard link).
OK, that's it. We got there, though not without some anxious moments!

wms myproxy renewal wobbles

During our recent reconfiguration to SL5 we also re-wrote our user account generation script from perl to python. Well Graeme did actually. So now its very easy to understand and extend. A consequence of this was that we created a new directory in /home for each user to keep things neat and tidy. This necessitated the recreation of all home directories across the cluster. A task fraught with danger.

However, we managed it except that I blew away the glite user from the WMS in the process and the .certs and .globus certificates required to run the WMS. After replacing them everything worked fine or so I thought. Recently we received reports that the myproxy renewal was not working and as it transpired the /home/glite/.certs/hostkey.pem and /home/glite/.certs/hostcert.pem must be owned by the glite user and not root for the renewal process to work! One to watch!

Friday, September 18, 2009

SL5 migration and CPU deployment

As of last week have now migrated ScotGrid-Glasgow to SL5. This meant worker nodes, DPM's and batch system all becoming SL5 in one big flurry of activity. We started on Monday morning and came out of downtime on Wednesday evening with SAM tests passing. Since then we have been mopping up the remaining issues that cropped up along the way but more on that later.

So as of 16th September 1800 jobs slots running SL5 out of a total of 1912. The remaining 112 job slots have been held back as SL4 till December to allow those VO's with unpatched software kits or that are simply not ready to move to SL5 to run jobs.

Similar to RAL and other sites we have gone with separate CE's between SL4 and SL5 to allow for those VO's that cannot co-exist on the same CE. These CE's will very shortly be submitting to the same batch system using node requirements :SL4, :SL5 set from the CE as described by SouthGrid from their SL3 to SL4 migration. It does necessitate from job manager tweaking but it works. I may try and switch this to submit filter when I get the chance as job manager tweaking is never very robust.

Wednesday, September 09, 2009

Canna I no just use qsub?

Ah, the endless refrain.

Anytime a user with cluster experience is introduced to the gLite submission mechanism, some question of that order (although not always with a Scottish accent) is inevitable.

Pulling out my Human-Computer Interaction hat, I first came to the conclusion that, despite the occasional hints to the contrary, users are indeed Human. Hot on the heels of this realisation, a little bit of analysis of the gLite job submission and control tools indicated that, whilst very powerful, they work in a very different fashion to qsub.

It's not clear that qsub is in any sense a better iterface than the native command line tools, but it is clear that it is different.

The general idea was to resolve this difference by providing a different interface to grid job submission that was more familiar to users with existing experience of cluster computing. Wether it's going to be a better approch for a user without that experience is not clear; but it will make it simpler for users to use the Grid as an offload for a local cluster (i.e. use a cluster, when it's full, send the jobs to the Grid).

It turns out that the POSIX defintion of qsub isn't too far away, conceptually, from a Grid system, so all that was needed to act as an interface transalation layer was a relativly straightforward python script.

Rather than relay all the gory details here, let me direct you to the gqsub download page, with the manual.

For users on svr020, it's installed in the default path, so you can just use it. Note that to properly mirror the expected behaviour you probably want to make sure you run from within $CLUSTER_SHARED.

But to answer the original question: "Aye!"

Wednesday, September 02, 2009

who changed the job wrapper?

It was a long night yesterday as Graeme and I tried to fix our failing ops CE tests. It started on Monday night when SAM mysteriously started failing across all CE's at Glasgow and then Durham. The jobs appeared to run but just stayed at the running state until the WMS presumably killed them and eventually failed ops tests.

After investigation we noticed the 'cannot download .BrokerInfo from' error. A quick look on a node proved that it was owned in /tmp by another user rather than ops. A strace -f -p NNNN on the globus-url-copy command process showed the ops job was getting a permission denied when trying to create/copy the file. A look at past CE-sft-broker tests showed the a very clear difference, in fact there was a missing directory!

-rw-r--r-- 1 sgmops001 opssgm 3085 Aug 31 05:06 /tmp/https_3a_2f_2fwms208.cern.ch_3a9000_2fElSbIsNqd8SN69eCXPN1JA/.BrokerInfo


-rw-r--r-- 1 sgmops001 opssgm 2312 Sep 1 22:34 /tmp/.BrokerInfo

Removing this file allowed the ops test to run but why it was happening was still a mystery. A work around we have deployed is to create an additional directory in cp_1.sh i.e.

# Workaround for gLite WMS jobs, which don't cd into EDG_WL_SCRATCH...
echo In cp_1.sh
echo Making temporary work directory
templ=$TMPDIR/glite_run_XXXXXXXX
temp=$(mktemp -d $templ)
echo Changing work directory to $temp
cd $temp

In the end we had to remove every blocking .BrokerInfo file from /tmp across the cluster and ops jobs started passing again. Further digging showed that the job wrapper has changed somewhere along the line. The old job wrapper had code like this in it.

#if [ ${__job_type} -eq 0 -o ${__job_type} -eq 3 ]; then # normal or interactive
 newdir="${__jobid_to_filename}"
 mkdir ${newdir}
 cd ${newdir}
#elif [ ${__job_type} -eq 1 -o ${__job_type} -eq 2 ]; then # MPI (LSF or PBS)
#fi

This has now been removed and could be causing issues for other sites. Torque and SGE have functionality to ring-fence every job perhaps we would have been safer using it but running jobs from /tmp worked for 3 years. Not any more it would seem.