Tuesday, February 24, 2009

the ce lives!

svr021 is back on-line and accepting jobs. The beefier hardware now includes 4 dual core 2.4 GHz CPU's with 8GbB of memory and a RAID 1 set-up. We had a pretty smooth rebuild process thanks to a development CE to test and cfengine to roll it out. So we now have VOMS and the local user-no-voms hack working on both with a shared gridmapdir for easier pool account tracking. SAM appears to be okay although we may have to fix some failing NGS specific INCA tests. These are failing as they are expecting specific pool account mappings that just aren't there any more.

Thursday, February 19, 2009

redundancy in grid services proves it worth

From about midway through 2008 there has been a second lcg-CE, svr026 at Glasgow. This has proved useful for dealing with intermittent failures through the course of normal operations. However, some time last year the original lcg-CE, svr021 started to seg fault when the VOMS plugin was activated. Therefore, VOMS was turned off and the second lcg-CE took on the role of offering VOMS mapping to group accounts whilst the original CE dealt with mapping local users to local accounts with no VOMS. So in an effort to fix svr021 and build a fully redundant system some changes have been scheduled. Namely, server rebuild of svr021 and NFS mounting of the gridmapdir to share it between all CE's. However, taking down key grid services in a running production cluster full of jobs is not a trivial matter. So here is an overview of the experience for our lcg-CE.

Of the tasks that we had to do the easiest by far was to NFS mount the gridmapdir (used to track the VOMS pool accounts). This was created on disk037 and a cron'd rsync set up from second lcg-CE, svr026 which had working VOMS. This allowed a mirror of the currently running gridmapdir to be created. After some testing on development and during a quiet moment the gridmapdir on svr026 was blown away and remounted from NFS. This was successuful as far as I could tell. For resilancy another rsync and a cron'd script was setup from the NFS to a backup dir on svr026 so that should the NFS fail. It will revert back to a local gridmapdir automatically. This will also be done on svr21 as part of the rebuild.

svr26 at this time although supporting VOMS did not support the "don't map local accounts to pool accounts" addition that svr021 had been retrofitted with. This was outlined here: http://scotgrid.blogspot.com/2008/02/to-voms-or-not-to-voms-that-is-question.html . This was applied to svr026. Again this was tested thoroughly on development and was applied successfully to production. So now svr026 was nearly in a state to become the primary lcg-CE. What it did not have was support for all the available queues at Glasgow. As historically some queues were shared and some were specific to svr021. This was fixed by re-running yaim on svr026 and setting the appropriate $QUEUES in the site-info.def. Having a development CE to test this was invaluable since the last time yaim had been run a CE was sometime last year and the behaviour of one of the functions this time round actually caused yaim to fail. With a fix in place svr026 now supported all queues.

This meant that svr021 was ready for draining. The documented procedure for draining a lcg-CE relies on disabling queues in the batch system, pbs in this case. However, in our case we have shared queues so disabling svr021's queues would have disabled svr026's. Not what we wanted at all. So a workaround was found to modify the gip plugin on svr021 to always set the CEStatus values to Draining. This was then picked up automatically by LDAP and our site posted svr021 as draining. This should have been enough if it were not for direct submission to a CE as GStat and other tools using LDAP to determine available resources checked the CEStatus and knocked out svr021 but custom JDL and Ganga scripts can just select a CE in the script. After some investigation with running jobs it became apparent that in fact many of the local ScotGrid users use the Grid in this way. Doh. So after some experimentation on development we found that you could remove the CE from the hosts.equiv file on the batch system even when there were running jobs. This effectively stopped submission to the CE dead in its tracks and rather handly allowed running jobs to finish successfully. The only other thing to remember to do was downtime the node in the GOCDB as SAM would start failing as we were no longer advertising or accepting jobs on our svr021 lcg-CE.

With svr021 now draining the last piece of the puzzle is to rebuild it but since we support 7 day queues at Glasgow its going to be a long wait. Once its drained the plan is to make sure we have our 90 days worth of logs, run APEL to publish our final results, rebuild, apply the local-novoms patch and mount the gridmapdir from the NFS share. All going well we should have two fully operational lcg-CE's by next week with no lost jobs in sight.

Grid service redundancy allowed our Glasgow cluster to operate for sometime in a semi-broken state without the immediate requirement for a rebuild.
Having a test/dev server with which to test changes before applying them prod was invaluable.
Running mirrored grid services to perform maintenance tasks is a perfect way to keep your cluster accessible and running during downtime.
7 Days is a long time to wait for a queue to drain!

Wednesday, February 18, 2009

Draining an lcg-CE

Well draining an lcg-CE should be fairly straightforward if you have n CE's and n separate queues. However, in Glasgow's case we have 2 CE's that share queues. So it is just not as simple as running qmgr -c 'set queue enabled=false'on torque/maui as this would have put both CE's into drain!

After much playing around, breaking the site bdii and getting Graeme's help to fix it, it appears that the only way I can see to do this is to hack with gip plugins on the CE you wish to drain. The plugin in question appeared to be /opt/glite/etc/gip/plugin/glite-info-dynamic-ce which subsequetly called /opt/lcg/libexec/lcg-info-dynamic-pbs

This script contains the dynamic qstat queries to find out the state of the queues. So in order to drain a specific CE where queues are shared one possible solution is to hack this file to change the line:
push @output, "GlueCEStateStatus: $Status\n"; to force drain with push @output, "GlueCEStateStatus: Draining\n";

This worked as the LDAP query to svr027 now showed 4 queues in drain:

svr021:/opt/glite/etc/gip/ldif# ldapsearch -xLLL -b mds-vo-name=UKI-SCOTGRID-GLASGOW,o=grid -p 2170 -h svr027.gla.scotgrid.ac.uk | grep Dra
GlueCEStateStatus: Draining
GlueCEStateStatus: Draining
GlueCEStateStatus: Draining
GlueCEStateStatus: Draining

and glite-wms-job-list-match does not display the queues for use through the WMS:

-bash-3.00$ glite-wms-job-list-match -a --vo vo.scotgrid.ac.uk hello.jdl
Connecting to the service https://svr022.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server
The following CE(s) matching your job requirements have been found:

- svr026.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs-q1d
- svr026.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs-q2d
- svr026.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs-q30m
- svr026.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs-q6h

However, this does not stop direct job submission via Globus. After much playing around we found that if you edit the hosts.equiv file on Torque. You can stop job submission from the desired CE but still allow running jobs to finish. Handy that - its just what we need as we were seeing some users still using direct submission even though we were trying to drain them.

Thursday, February 12, 2009

HammerCloud 135---A Load Shared is a Load Halved, to a point.

We performed our splitting of the DPM across two hosts just in time for the most recent HammerCloud test on UK sites:

So, we already have some metrics to compare the old arrangement with the new.
For reference, Graeme blogged about the last big HammerCloud UK test here, where we were getting an event rate of around 10Hz, at the cost of the DPM head node running at an unsustainable load.
Since then, a couple of Hammerclouds have come by, generally coincident with ATLAS production and other stresses on the DPM, and it has just utterly failed to cope.

After our surgery, we did a lot better:

with an event rate of about 14 Hz, a 50% improvement, almost.

and, the load on the DPM head node was very much more acceptable, given the increased power of the hardware:

However, we're still not close to maxing out the pool nodes:

probably because we've hit another, higher, performance bottleneck on the new svr015 "MySQL server" machine:

that orangish stuff is the CPU in I/O Wait state, waiting for seeks within the DB.
We're currently looking at ways of tuning MySQL, or the disk, to improve this performance, since it looks like there's another 30 to 40% of performance there, at least.

Some ideas we've had include splitting the dpm_db and cns_db across different filesystems (since they have very different access patterns for this kind of use), tweaking MySQL settings (although they look generally fine...), or even getting Faster Disks. Roll on solid state drives, we say!

DPM improvements!

Ever since ATLAS analysis has been enabled at Tier 2 sites (and the relevant sheaves of AOD files have arrived at our DPM), the Glasgow DPM has been looking increasingly strained.
This first became obvious during the HammerCloud tests for analysis in December, but over January it became increasingly clear that the access patterns of normal analysis jobs, en-mass, are quite enough to make the storage unreliable for other users.
In particular, we had one period where chunks of ATLAS production work died because the DPM was so overloaded.

Looking at the DPM during these periods, it looked like it was a combination of I/O waits and, more significantly, the dpm and srmv2.2 daemons maxing out the CPU.

Last Friday, we tried "optimising" the DPM MySQL backend by taking the dpm offline, and then exporting, dropping, and reimporting the dpm_db and cns_db databases. The InnoDB engine has an issue that it sometimes becomes fragmented, increasing the size of the physical DB file and reducing performance - reimporting from a logical backup usually reduces this fragmentation in the restored DB.
Unfortunately, this reimporting process took far longer than we anticipated---on the order of 5 hours!---and, in the end, resulted in a distinctly unimpressive 10% size reduction in the physical DB.

After bringing things back up again, however, it became clear that the performance hadn't changed much, and that it was most likely that we just needed to give the DPM processes more room to breathe.
Our DPM is considerably underspecced compared to our new worker nodes (which are lovely 8-core machines, at higher clock rates), but, of course, has the benefit of RAIDed storage to give our DB a bit more reliability. So, we decided to take the big step of splitting the DPM across two nodes - the old DPM being moved to a role as "MySQL backend server", and the "new" DPM being a repurposed worker node hosting all the DPM services.

Thanks to cfengine, and the arcane workings of YPF, it isn't too hard to make a node into any other kind of node that we want---the tricky bit, in this case, is swapping the hostnames, so that the "new" DPM still gets to be svr018, while the old DPM gets moved to svr015 (and also hosts our DPM monitoring stuff now).
The new svr018 used to be node310 - the last node in our pool of new worker nodes - which I'd previously taken offline and allowed to drain over the weekend in anticipation of this.
However, thanks to some synchronized administration by Mike and myself, things seemed to go relatively smoothly with the move on Monday, with only an hour of downtime and barely a failed job in sight, despite being full of ATLAS production at the time.

It looks like this also improved our HammerCloud performance, about which more in a later post.