Tuesday, September 22, 2009

Fun with 5!

So, the SL5 migration was done last week, and as promised at yesterday's dteam meeting I am posting our problems so that other sites can watch out for similar issues (although some of these are deeply related to the way we do things at Glasgow).

First though, the successes:
  1. 1800 cores running SL5
  2. DPM headnode upgraded to SL5
  3. Torque server upgraded to SL5, running torque 2.3.6, maui 3.2.6p21
  4. /atlas/uk voms group supported with a separate fairshare
Now, the list of problems:

1. We introduced a new python script, gridAccounts.py, to generate pool accounts, retiring the venerable, but incomprehensible, genaccts.pl script we had before (Andy Elwell wrote that and his comment was "OK - I give up with python as I need this NOW..."; my retort was "I HATE PERL SO MUCH. IT'S A SHIT LANGUAGE.", but I had never found the time to rewite it until now). The new script reads standard config files, so it's a lot easier to manage, understand and extend. However, all change is (a bit) dangerous and the new script initially had groups in the wrong order in yaim's user.conf, which caused the groupmapfile to be wrong. This then caused all jobs to fail - the uid/gid of the gridftp session did not match the uid/primary gid of the user and gridftp does not like that at all.

(The reason we have to write users.conf is because we still get yaim to do a lot, although we manage all accounts through cfengine; yaim relies on this file to configure various other aspects of the system, such as grid/group mapfiles.)

2. We were trying to support the /atlas/uk VOMS group as a separate entity. This is simple in theory (!), you're looking for the following entries in voms-grid-mapfile:

"/atlas/uk/Role=NULL/Capability=NULL" .ukatlas
"/atlas/uk" .ukatlas

and this in groupmapfile:

"/atlas/uk/Role=NULL/Capability=NULL" atlasuk
"/atlas/uk" atlasuk

If we were managing these files directly, it would have been no problem. However, convincing YAIM to do this was far from easy. This is not helped by the fact that YAIM is now utterly incomprehensible in many ways (have a look at yaim/utils/users_getvogroup if you don't believe me). Finally we hit on the correct recipe, which is to have these accounts in users.conf, with a new "special" defined:

201601:ukatlas001:201040,201000:atlasuk,atlas:atlas:uk:
201602:ukatlas002:201040,201000:atlasuk,atlas:atlas:uk:
201603:ukatlas003:201040,201000:atlasuk,atlas:atlas:uk:
...

with this line added to groups.conf:

"/VO=atlas/GROUP=/atlas/uk":::uk:

Aside: Sometimes I wonder if YAIM has outgrown its usefulness. From something we could understand and tweak easily it's now a sed|awk|cut|sort|tail black box monster, which uses a computerised format for configuration files. c.f. the configuration we have for our own scripts:

[someuser]
dn = /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=some user
uid = 4832
home = /clusterhome/home/someuser
group = atlas
tier25 = True
vo = gla

And trying to do grid configuration manipulations in a language which doesn't have dictionaries is just ridiculous.

Maybe we'll need to wean ourselves off it eventually?

3. Information publishing on svr018 was broken after the upgrade. There was a cryptic reference to a required 'dpminfo' user which Sam had made in his notes. Adding this user did seem to make things work, though it's not at all clear why. Hopefully Sam will enlighten us later. In passing, note that the resource BDII on the service nodes seems to be 'protected' now, so attempts to reach it from 'outside' fail. This is new behaviour and lost us some time in debugging.

4. Terrible trouble was caused by upgrading the torque server to SL5. Using the SteveT build of torque server (http://skoji.cern.ch/sa1/centos5-torque/) seemed to cause grave problems with moms crashing on the worker nodes. Downgrading the moms to the torque 2.3.0 didn't work as the jobs files (/var/spool/pbs/mon_priv/jobs) seemed to be in incompatible formats and led to crashing moms plus a very confused torque server. Cleaning out all jobs seemed to not work either. The final solution was to rebuild torque 2.3.6 on SL5 - this gave a consistent and compatible server/mon pairing.

A small side effect though, was the the rebuilt maui had a different 'secret' in it, so I have had to hack the info provider on the SL4 CEs to use the --keyfile= argument in the maui client commands. (That's such a stupid 'feature'.)

5. Once we were out of downtime, random transfers to the DPM were failing. Eventually we tracked to the reduction in the number of pool accounts for atlasprd. There was no sync between the passwd fila and to the /etc/grid-security/gridmapdir pool account list, of course, so gridftp was was throwing a "530 Login incorrect. : No local mapping". We realised that
    1. /etc/passwd should be handled better on nodes which need to map pool accounts.
    2. For the moment never reduce the number of accounts!
    3. N.B. on the CEs the gridmapdir is shared, so maintenance probably needs to be delegated
    4. If we remove a pool account mapping then you have to remove the link from any DNs to this mapping as well (look for DN filenames with only 1 hard link).
OK, that's it. We got there, though not without some anxious moments!

2 comments:

Elwell said...

hey, the Perl script was knocked up in an evening - At the time my python-fu was weak, but I've since moved over to the dark side. Course, for real regexps Perl is still better (handbags at 50 paces) :-)

Graeme Stewart said...

It's python replacement was also done in a day. The problem was not some much the original script but all of the hacked on extensions to support

yet:another:colon:separated:attribute

then, all of the fields were transferred into arrays with split(/:/, $_) and referenced by number.

So, it was s/perl/python/g :-)

There's more to life than regexps, you know!