Monday, February 21, 2011

The CE is dead. Long live the CE. Nos paenitet incommodo

As part of the on-going developments to the Scot Grid cluster at Glasgow, we have decommissioned our final LCG-CE, which resided on SVR021. The removal of this CE allows us to concentrate the support and development of two CE platforms; Cream and ARC. We are planning to conduct a series of tests around the three CREAM CE's we have deployed at Glasgow in an attempt to gain a better understanding of their maximum loading potential for running jobs and how to tweak them to gain the maximum efficiency from this service.

Additionally, we will be testing our availability metrics over the next month as the LCG-CE was one of the corner stones of Steve Lloyd's tests of our overall availability. This will now be monitored primarily through our SRM availability.

The reasons for decommissioning the LCG-CE are that we would be removing it at some point in the near future, all the big VO's do not have issues with submitting to Cream CEs and it simplifies our internal support requirements.

The new servers running Cream are svr008, svr014 and svr026.

Thank you LCG-CE and goodnight.

Covering up problems with CREAM

For some days now, ScotGrid Glasgow has been operating with only CREAM CEs, having turned our final lcg-CE off around the 14th. I'll let Mark cover the details of this in his later post, but I wanted to briefly mention one of the minor configuration details that caused some problems for us initially.

The gridmapdir (usually in /etc/grid-security/gridmapdir ) is a somewhat integral part of the pool account mapping system in LCG/gLite services. It contains one (empty) file for each pool account, plus hard-links to them from each DN(+VOMS Role) mapped to them. Basically, it's a cheap way to ensure that you don't get multiple mapped DNs to the same account (as you can always count the number of hard-links to an inode).

We share our gridmapdir, over NFS, to all of our CEs, to ensure that any incoming job from a given user is consistently mapped. Unfortunately, this lead to our minor configuration gaffe (which I just fixed).
The lcg-CE, you see, is configured to set the ownership and permissions on the gridmapdir to 0755 root:root. This is fine for it, since lcg-CEs do strange things like running their services with root permissions, and it prevents anything else from messing up the mappings.

CREAM CEs (using glexec), need to have their gridmapdir as 0775 root:glexec, a change which we hadn't made when we installed them (and which probably YAIM couldn't have done for us). This meant that, for the time the CREAM CEs were installed, they've never been able to create a new mapping in the gridmapdir, as they try to do that as members of the glexec group.
We never really noticed this problem while we had lcg-CEs which were busy, as the lcg-CE would almost always have also received jobs from the user previously and already performed the mapping.

Now that we don't have an lcg-CE, however, it started to cause some odd problems when we enabled new VOs, as the configuration seemed perfectly fine for the VO itself, but jobs would bounce off the CREAM CEs with "Failed to get the local userid with glexec" errors.
Obviously, this was trivially solved once we worked out what the issue was (by setting the gridmapdir's group-ownership and permissions to glexec g+w), but identifying it was a little tricky, as the default logging level for LCMAPS doesn't give many clues as to what problem it's having.
Turning the debug level up to 3 (in /opt/glite/etc/glexec.conf ) was sufficient to get it to log errors with gridmapdir_newlease(), however, and then, after some poking (and manual creation of DN links to see what happened), the problem became clear.

So, this is a cautionary tale about moving from a mixed CE environment to a monoculture (ignoring Stuart's ARC installation) - sometimes a misconfiguration in one service can be hidden by the correct functioning of the service you're just about to remove.