Thursday, September 20, 2007

Glasgow Upgraded to SL4

The upgrade is done! We started passing ops SAM tests at about 2230 last night, and I brought us out of downtime at 2300. That was 12 hours of total downtime. In addition the queues were closed from about 1600 the day before, so that meant we were unavailable for 31 hours. In the grand scheme of things I think, "not bad," for such a major upgrade.

Preparations for the upgrade were rushed, but certainly thorough enough for us to have a fair degree of confidence in the process. By Tuesday night I was able to reboot, rebuild and run jobs through a worker node successfully. Andrew was close to having the new pool account generator done, even if he had wimped out and used perl.

We had decided the plan was to upgrade the worker nodes and bring us out of downtime ASAP, then work on the UI and other less central services.

Here's my synopsis of what went wrong, or didn't behave quite as we expected:
  1. We initially tried to reboot the worker nodes in batches of 30. This overloaded dhcp or tftp on svr031, so in fact only 4 nodes were successful in that batch. Subsequently we did batches of 12, which worked fine. We could also put a larger stagger on the powernode reboot script (we had only used 1s).
    Analysis: It was always going to be hard to know what level we could do this until we tried. It was easy to work around. Probably our rebuild time for the whole cluster is ~2-3 hours because of this node throughput limitation.
  2. At the last minute I decided to just drop alice and babar to stop us from supporting VOs who just don't, or can't, use us (it's just clutter). However, that change was imperfectly expressed in site-info.def, so on the first batches of nodes YAIM just didn't run.
    Analysis
    : This was a mistake. Andrew and I should have co-ordinated better and had more time to review the new user information files.
  3. There were a few problems with the user information files: sgm and prd accounts weren't initially in the normal VO group. In addition local Glasgow users were in the wrong group. This was fixed pretty rapidly.
    Analysis: As above. This aspect of the preparation was too close to the critical path - and it didn't work first time.
  4. The new server certificates were botched initially. Although we were in downtime and it was relatively easy to correct, it was a distraction. Analysis: We need to document local procedures for certificate handling better.
  5. We'd been obsessing about the batch worker configurations, with the intention to leave the servers pretty much alone. However, we hadn't twigged that the change to pooled accounts for sgm and prd users would, of course, require the LCMAPS group and grid mapfiles to be updated. As no one on site is an sgm or a prd user this was not picked up during testing. It only came to light once I did a logfile analysis of why ops tests were failing (these are done as an sgm ops user). Later in the evening it became clear that this also had to be done for the DPM disk servers.
    Analysis
    : If I'd been sharper I would have realised this in advance (but there was a lot on my mind). It would be useful of one of us had a special role to do this testing (gridpp VO would be ideal). However, it would actually have been a terribly hard thing to test, as the site was "live" during the testing phase and this problem's solution implied reconfiguring the CE as well as the pool accounts. Hopefully writing it down here will make us more cognoscent of this next time!
  6. Running YAIM automatically if all well and good, but how do we know it's run successfully? We not only had nodes where YAIM jusy hadn't run, we also (and this was the last problem to be fixed), had two bad nodes where the directories in /opt ended up in mode 0700, so were unreadable.
    Analysis: We need to develop a test and alarm system for which attempts to validate the YAIM run. At the moment we're pretty much flying blind. The two proxies which I ended up using yesterday were:
    1. Look for files generated by YAIM, e.g., /opt/glite/etc/profile.d/grid-env.sh. There should be a nagios alarm or a cfengine warning if this file absent.
    2. Check permissions on directories such as /opt/glite/etc. If this is not readable to a pool account then something has gone wrong.
Summarising, I think a pretty good job was done yesterday. It was a major upgrade and our first significant downtime since last November. If we can keep these sorts of interventions down to the 1-2 day level then the site will continue to be considered a good one.

However, we're working as a team now, rather than me playing Lone Ranger. This makes co-ordination, documentation and testing even more vital. Once Mike comes properly on board his first major task will be to understand and then document how the cluster is run.

No comments: