Wednesday, September 26, 2007

RB: "Rather Better"

One thing which went to pot during the upgrade, was the way that the higher UIDed pool accounts cascaded through to the RB. This, unfortunately, meant that any jobs which were running on svr023 were lost (there would have been very few, in fact, which is why we spent more efforts on the UI).

However, in our attempts to get the RB back it became clear that the edg-wl-ftpd service (yet another hacked version of GT2 gridftp) cannot handle UIDs > 16bit. This screwed things up for us, as all our new UIDs are in the range 200000+.

In the end I had to re-hack the perl passwd/group/shadow/users.conf generator, lowering all the UIDs specially for the RB. In fact this was not quite as awful as one might think, as the RB supports only a subset of the VOs we run jobs for on the main site. I also scripted up a "generator" for the RB's site-info.def, that strips down the VOS variable to those we support for job submission. In addition, communication between RB and the UI or the CE is of course mediated by certificate, so having a different pooll account or UID on the RB is not a problem.

There was a supporting tweak to cfengine to take passwd-rb (etc.) as the source passwd file for the RB.

Then the RB was blown away and rebuilt. It seems to have done it rather a lot of good, as now Steve Lloyd's dteam test jobs run properly (see his RB test page).

No comments: