Monday, February 21, 2011

Covering up problems with CREAM

For some days now, ScotGrid Glasgow has been operating with only CREAM CEs, having turned our final lcg-CE off around the 14th. I'll let Mark cover the details of this in his later post, but I wanted to briefly mention one of the minor configuration details that caused some problems for us initially.

The gridmapdir (usually in /etc/grid-security/gridmapdir ) is a somewhat integral part of the pool account mapping system in LCG/gLite services. It contains one (empty) file for each pool account, plus hard-links to them from each DN(+VOMS Role) mapped to them. Basically, it's a cheap way to ensure that you don't get multiple mapped DNs to the same account (as you can always count the number of hard-links to an inode).

We share our gridmapdir, over NFS, to all of our CEs, to ensure that any incoming job from a given user is consistently mapped. Unfortunately, this lead to our minor configuration gaffe (which I just fixed).
The lcg-CE, you see, is configured to set the ownership and permissions on the gridmapdir to 0755 root:root. This is fine for it, since lcg-CEs do strange things like running their services with root permissions, and it prevents anything else from messing up the mappings.

CREAM CEs (using glexec), need to have their gridmapdir as 0775 root:glexec, a change which we hadn't made when we installed them (and which probably YAIM couldn't have done for us). This meant that, for the time the CREAM CEs were installed, they've never been able to create a new mapping in the gridmapdir, as they try to do that as members of the glexec group.
We never really noticed this problem while we had lcg-CEs which were busy, as the lcg-CE would almost always have also received jobs from the user previously and already performed the mapping.

Now that we don't have an lcg-CE, however, it started to cause some odd problems when we enabled new VOs, as the configuration seemed perfectly fine for the VO itself, but jobs would bounce off the CREAM CEs with "Failed to get the local userid with glexec" errors.
Obviously, this was trivially solved once we worked out what the issue was (by setting the gridmapdir's group-ownership and permissions to glexec g+w), but identifying it was a little tricky, as the default logging level for LCMAPS doesn't give many clues as to what problem it's having.
Turning the debug level up to 3 (in /opt/glite/etc/glexec.conf ) was sufficient to get it to log errors with gridmapdir_newlease(), however, and then, after some poking (and manual creation of DN links to see what happened), the problem became clear.

So, this is a cautionary tale about moving from a mixed CE environment to a monoculture (ignoring Stuart's ARC installation) - sometimes a misconfiguration in one service can be hidden by the correct functioning of the service you're just about to remove.

1 comment:

Graeme Stewart said...

I had a moment of panic thinking that our squid had fallen over today in the ATLAS Ops meeting. However, it was only the fact that the ATLAS CE tests only run on lcg-CEs so far.

I spoke to Ale about it and a move to CREAM CE monitoring is foreseen when the experiment monitoring frameworks move from SAM to NAGIOS. Should happen in a few weeks.
http://dashb-sam-atlas.cern.ch/dashboard/request.py/historicalsiteavailability?mode=siteavl&siteSelect3=400&sites=UKI-SCOTGRID-GLASGOW&algoId=121&timeRange=last48

Note we go red when the node exists, but is in maintenance, Then we go white once it's decommissioned as now there is nothing to test.