Thursday, February 19, 2009

redundancy in grid services proves it worth

From about midway through 2008 there has been a second lcg-CE, svr026 at Glasgow. This has proved useful for dealing with intermittent failures through the course of normal operations. However, some time last year the original lcg-CE, svr021 started to seg fault when the VOMS plugin was activated. Therefore, VOMS was turned off and the second lcg-CE took on the role of offering VOMS mapping to group accounts whilst the original CE dealt with mapping local users to local accounts with no VOMS. So in an effort to fix svr021 and build a fully redundant system some changes have been scheduled. Namely, server rebuild of svr021 and NFS mounting of the gridmapdir to share it between all CE's. However, taking down key grid services in a running production cluster full of jobs is not a trivial matter. So here is an overview of the experience for our lcg-CE.

Of the tasks that we had to do the easiest by far was to NFS mount the gridmapdir (used to track the VOMS pool accounts). This was created on disk037 and a cron'd rsync set up from second lcg-CE, svr026 which had working VOMS. This allowed a mirror of the currently running gridmapdir to be created. After some testing on development and during a quiet moment the gridmapdir on svr026 was blown away and remounted from NFS. This was successuful as far as I could tell. For resilancy another rsync and a cron'd script was setup from the NFS to a backup dir on svr026 so that should the NFS fail. It will revert back to a local gridmapdir automatically. This will also be done on svr21 as part of the rebuild.

svr26 at this time although supporting VOMS did not support the "don't map local accounts to pool accounts" addition that svr021 had been retrofitted with. This was outlined here: http://scotgrid.blogspot.com/2008/02/to-voms-or-not-to-voms-that-is-question.html . This was applied to svr026. Again this was tested thoroughly on development and was applied successfully to production. So now svr026 was nearly in a state to become the primary lcg-CE. What it did not have was support for all the available queues at Glasgow. As historically some queues were shared and some were specific to svr021. This was fixed by re-running yaim on svr026 and setting the appropriate $QUEUES in the site-info.def. Having a development CE to test this was invaluable since the last time yaim had been run a CE was sometime last year and the behaviour of one of the functions this time round actually caused yaim to fail. With a fix in place svr026 now supported all queues.

This meant that svr021 was ready for draining. The documented procedure for draining a lcg-CE relies on disabling queues in the batch system, pbs in this case. However, in our case we have shared queues so disabling svr021's queues would have disabled svr026's. Not what we wanted at all. So a workaround was found to modify the gip plugin on svr021 to always set the CEStatus values to Draining. This was then picked up automatically by LDAP and our site posted svr021 as draining. This should have been enough if it were not for direct submission to a CE as GStat and other tools using LDAP to determine available resources checked the CEStatus and knocked out svr021 but custom JDL and Ganga scripts can just select a CE in the script. After some investigation with running jobs it became apparent that in fact many of the local ScotGrid users use the Grid in this way. Doh. So after some experimentation on development we found that you could remove the CE from the hosts.equiv file on the batch system even when there were running jobs. This effectively stopped submission to the CE dead in its tracks and rather handly allowed running jobs to finish successfully. The only other thing to remember to do was downtime the node in the GOCDB as SAM would start failing as we were no longer advertising or accepting jobs on our svr021 lcg-CE.

With svr021 now draining the last piece of the puzzle is to rebuild it but since we support 7 day queues at Glasgow its going to be a long wait. Once its drained the plan is to make sure we have our 90 days worth of logs, run APEL to publish our final results, rebuild, apply the local-novoms patch and mount the gridmapdir from the NFS share. All going well we should have two fully operational lcg-CE's by next week with no lost jobs in sight.

Conclusions
Grid service redundancy allowed our Glasgow cluster to operate for sometime in a semi-broken state without the immediate requirement for a rebuild.
Having a test/dev server with which to test changes before applying them prod was invaluable.
Running mirrored grid services to perform maintenance tasks is a perfect way to keep your cluster accessible and running during downtime.
7 Days is a long time to wait for a queue to drain!

No comments: