ScotGrid: February 2007

Tuesday, February 27, 2007

Devolution? cfengine style

In an attempt to allow the stragglers of the cluster to feel as if they are doing their own thing, we have let them lurk on the right hand side of the machine room, safe in their "slightly different hardware" setup to allow them to feel "Speshul". Time... for some "gentle persuasion" in the guise of cfengine.

First off, Not all disk servers are created equal. Once we'd Created and formatted the array, we then had to mount it. Cue the following snippet in the directories: stanza
disksvr::
# Mount points for Raid space
/gridstore0 mode=0755 owner=root group=root
/gridstore1 mode=0755 owner=root group=root
/gridstore2 mode=0755 owner=root group=root
/gridstore3 mode=0755 owner=root group=root
/gridstore4 mode=0755 owner=root group=root

and a new central /etc/fstab thrown out to all the nodes (in the copy: stanza)
disksvr::
$(skel)/disksvr/etc/fstab mode=644 dest=/etc/fstab type=sum

Finally I remembered that I should have reduced roots reserved space to zero on those chunks:

for i in `seq 1 5` ; do tune2fs -r 0 /dev/sdb$i ; done

Monday, February 26, 2007

SplayTime Finally Cracked

Finally I've got splaytime to work. Nothing to do with capitalisation; it needs to go into update.conf, not in cfagent.conf. Notice the hourly spikes dissapear after 1800 - that's when the splaytime of 25 minutes went into update.conf and started working.

LCG Gatekeeper On svr016 Patched For stagein

In our investigations for supporting NGS it became clear that the lcg-CE does not support GRAM stagein properly. There is a patch for this available in Savannah, which I have now applied to the CE.

Olivier and Alessandra were running this patch at IC and Manchester, so I was reassured enough to apply it to the live system. Checking the gatekeeper logs it doesn't seem to have broken anything, and hopefully it will make a difference for our local users attempting to get GRAM working properly.

Just in case the CE goes down in flames, the patched Helper.pm has been added to cfengine's skeleton files and a new line added to the copy stanza:


copy:
        grid.ce::
                # This applies patch https://savannah.cern.ch/bugs/?func=detailitem&item_id=4400 to the CE
                # which should allow for stagein with globus-job-{submit,run}. Also rumored to help
                # with ATLAS condor submissions...
                $(skel)/ce/gatekeeper/Helper.pm mode=0755 dest=/opt/globus/lib/perl/Globus/GRAM/Helper.pm type=sum

Finally, there's a rumor that this patch might help with ATLAS condor submissions. For the record svr016 currently is at 70% on the ATLAS Prod site efficiency pages. I'll keep an eye on this (although it mixes condor and RB submissions, so it's quite hard to track each method separately).

cfengine Running From Cron Too

I found another node on which cfengine (more precisely cfexecd) had died - so it hadn't run for sometime. I have now added a cfagent run from cron, which fires every half hour.

In addition, I've added a "processes" stanza to the cfengine configuration, which means that if any of the cfengine daemons are not running then the crontabbed cfagent run will restart them:


processes:
        any::
                "cfenvd" restart "/usr/sbin/cfenvd"
                "cfservd" restart "/usr/sbin/cfservd"
                "cfexecd$" restart "/usr/sbin/cfexecd"

Note the regexp on cfexecd - it's to stop the "cfexecd -F" run from cron from matching.

A couple of other things:

I think I found the problem with my "splaytime" directive: it's "SplayTime", not "splaytime"!
I added the time class Hr04 to the tidy stanza. From running cfagent by hand it was clear that this was the most expensive operation - and it really does only need run once a day.

Durham Queue reogranisation

Last week Durham implemented new queue structure, we now have three grid queues -

ops (for ops jobs)
pheno (for the phenogrid VO)
grid (for all other grid jobs)

We have also implemented job suspension based upon queue priorities, this allows us to give local queues priority over grid jobs. How to configure preemption has been written up here http://www.gridpp.ac.uk/wiki/Maui_Preemption

Minor Cluster Updates

In anticipation of losing svr031 for a few days - and thus any easy way keeping the cluster up to date - I've been patching us up to the latest gLite release, r13.

There seems to be no good way of doing this other than a "yum update" - just updating the metapackage seems to not encapsulate all of the installed, but updated, RPMs, e.g.,


node001:~# yum update
[...]
I will do the following:
[update: edg-mkgridmap-conf 2.7.0-1_sl3.noarch]
[update: glite-config 1.8.4-0.noarch]
[update: glite-lb-common 3.0.6-1.i386]
[update: edg-mkgridmap 2.7.0-1_sl3.noarch]
[update: glite-rgma-command-line 5.0.4-1.noarch]
[update: glite-WN 3.0.13-0.noarch]
[update: glite-security-gsoap-plugin 1.2.5-0.i386]
[update: lcg-ManageSoftware 2.0-6.noarch]
[update: glite-rgma-api-python 5.0.10-1.noarch]
[update: glite-rgma-base 5.0.7-1.noarch]
[update: glite-wms-common 1.5.14-1.i386]
I will install/upgrade these to satisfy the dependencies:
[deps: lcg-tags 0.2.1-1.noarch]

But updating only the metapackage gives a palrty:


node001:~# yum update glite-WN
I will do the following:
[update: glite-WN 3.0.13-0.noarch]
I will install/upgrade these to satisfy the dependencies:
[deps: lcg-tags 0.2.1-1.noarch]

I think there's an argument for running a "yum update" nightly on the worker nodes, but it still seems far too dangerous a thing to do on the servers. Too much of a risk of daemons not restarting properly or java being arsed up.

I also added the new VOMS certificate for dzero - using cfengine this is easy. Unfortunately, not everything is yet cfenginified: machines currently not under cfengine control: svr019 (MON), svr021 (site BDII), disk servers. Should try and address this soon.

Thursday, February 22, 2007

svr031 removed from active cluster role

It's taken a while, but svr031 has now been taken out of active service.

All machines have had /etc/hosts and /etc/resolv.conf files put on them, which takes care of internal cluster name resolution, and they have had their DNS servers pointed at the normal university ones.

In addition one of the nat hosts was setup as a gateway machine for the worker nodes. The workers were (carefully) told to use this new host as their default gateway.

So, nothing is relying on any services provided by svr031 and we can prepare for a reinstall next week.

I had to do a bit of resuscitation of svr031, so that one can at least login via ssh and scp files to and from it. I've restored some of the library paths to get auxiliary commands working.

The cluster itself has been remarkably untroubled by svr031 being in a tizzy - I thought about putting us into a precautionary downtime, but this has not been necessary. We've carried on passing the SAM tests without trouble.

Leaders? Who needs 'em?

The President's Brain Is Missing...

svr031, the Glasgow cluster headnode, is currently a bit FUBAR.

Grieg was mirroring the dcache repository from DESY, becasue the mandatory webcaching policy at Glasgow stuffs up yum big time, hence any repos need to be got locally for installation to work. However. his mirror script has went crazy and managd to wipe the whole of /etc. Arggg!

Backup, what backup? It's a RAID 5 disk, we didn't need a backup (woops)...

svr031's roles in the cluster are:

NAT box for WNs
DNS server for whole of the cluster
Central syslogger
DHCP server
tftp server for kickstarting
http server for installation, nagios and ganglia

The critical run-time services are NAT and DNS. Fortunately the DNS server on svr031 and the NATing are still working, so even though the president's brain is missing the organs of the state still function, for now.

The immediate things to be done are to remove svr031's run time functions from the cluster. This will consist of:

Generate and copy an /etc/hosts file with a complete set of entries for the internal cluster machines - so that the internal DNS is not required.
Update /etc/resolv.conf on the cluster to use the standard university DNS servers
Setup NAT via the dedicated NAT boxes

After those steps have been taken our dependancy on svr031 should be removed, and we can work on reinstalling it and re-establishing its other services.

Wednesday, February 21, 2007

Durham's DPM and lcg-rep problems

Problem with lcg-rep and the DPM at Durham has been fixed, it was tracked down to a permissions problem on /etc/grid-security/gridmapdir. This directory needs to be group owned by dpmmgr, it was group owned by edginfo. The DPM machine crashed a few weeks ago, the system has to fsck it's disks on the way back up and I suspect this is when the ownership change happened. The reason that globus-url-copy worked and non of the other DPM protocols didn't was because the dpm-ftpd runs as root, where as all the other dpm daemons run as dpmmgr.

Tuesday, February 20, 2007

Disk Servers

fdisk doesn't like "big" disks. GNU Parted is better and can do things like cope with GPT disk labels, but it needs some scripting to make it easier. enter one uber-nasty script. Yes I know it doesn't do *any* sanity checking that you've run it on the right host, but hey, sysadmins never make mistakes :-)

cat mkpart.sh

echo "Running script on $1"
PARTED=/sbin/parted
DISK=/dev/sdb
ssh $1 <<EOF
hostname
$PARTED $DISK print
$PARTED $DISK mkpartfs primary ext2 0.017 1812371.050
$PARTED $DISK mkpartfs primary ext2 1812371.050 3624742.100
$PARTED $DISK mkpartfs primary ext2 3624742.100 5437113.150
$PARTED $DISK mkpartfs primary ext2 5437113.150 7249484.200
$PARTED $DISK mkpartfs primary ext2 7249484.200 9061855.233
$PARTED $DISK print
EOF

Also discovered while I was doing this that I'd only created a 2.1TB partition on disk040 as I forgot to enable the 64-bit LBA mode. Done and reinitialising volume.

> snmpget -v1 -c <communityname> -Pu -m all areca040 ARECA-SNMP-MIB::volProgress.0
ARECA-SNMP-MIB::volProgress.0 = INTEGER: 313
ie, 31% done so far....

Glasgow CE slow march to death with alice?

I noticed last week, after coming back from Australia, that the load on the CE had been creeping up and up.

On further investigation I have found 1790 gatekeeper processes running as the user alice001 - but we have no alice jobs running. Indeed, the torque logs show that we've never run any alice jobs (well a few alicesgm jobs in December only).

In fact the CE has hit swap now, so it's time to get biblical:

To every thing there is a season, and a time to every purpose under the heaven:
A time to be born, and a time to die; a time to plant, and a time to pluck up that which is planted;
A time to kill...

# kill $(ps aux | perl -ne 'print "$1 " if /^alice001\s+(\d+)/')

In the spirit of grid operations I have raised the issue as a GGUS ticket rather than emailing the user directly. This will be an interesting experiment in how well sites can raise problems with VOs through GGUS. Ticket #18703.

Mass job exit on UKI-SCOTGRID-GLASGOW

The Glasgow site had a very odd, and severe, drop in the number of running jobs at 0547 yesterday. At first I thought there had been a networking outage, which had chopped the jobs, but on further examination it seems not. The 100s of jobs which exited were all owned by a single (non-production) ATLAS user and seemed to exit "naturally" - they hadn't exceeded any queue limits.

The jobs were also running with a high CPU efficiency (e.g., cpu/wallclock of resources_used.cput=07:19:17 resources_used.walltime=07:50:56) - but they all exited at once.

Either the user must have found a problem with the jobs and cancelled them all (I wonder how one would see that from the gatekeeper logs, there doesn't seem to be any obvious way) or the jobs had a suicide pact.

Unfortunately this must have caused sufficient load on the CE to have us drop out of the information system for a time. Even running a separate site BDII doesn't seem to be a cure for all information system ills.

Thursday, February 15, 2007

This is a duck pond, right? I mean there's no way it's a lake...

Final presntation in Melbourne: "Grid Data Management and Storage (An EGEE-Centric View)".

This is organised by VeRSI (The Victorian eResearch Strategic Initiative). They have funding to setup multi-site storage of ~100TB to help scientists in Victoria share data and were very interested in the EGEE DM solutions. I ran though storage, catalogs, SRM 2.2, FTS, etc (presentation here). However, my conclusion was that operations were more important than technology choices - perhaps that is the real lesson from EGEE.

After learning a bit more about their project, I was inclined to recommend dCache for them - they have a muti-site storage problem, with dedicated networking between the data centres and dCache would seem to offer them the most flexible approach. Of course, as with many of the Australian grid projects, SRB seemed to be their de facto soution (needless to say, with shiboleth authentication), however in talking to their SRB expert about the LFC, I finally managed to get a number for the scaling of the SRB MCAT catalog - it can start to have problems when you have over 30,000 files. This seems terribly low - I know the LFC has been tested up to millions on even quite modest hardware (although the LFC is just a file catalog, where as MCAT is also a metadata calalog).

Of course, no software is a panacea, and they all have problems - perhaps the weakness of the EGEE solution is that there are so many bits to it - it would be quite a daunting thing to setup from scratch.

I'll be interested to see what they do decide in the end.

My fourth talk in Australia was the School of Physics Colloquium. Here was a chance to move away from the storage and data management focus, and deliver a much more general EGEE/LHC talk, which I entitled "Enabling Grids for eScience: How to build a working grid".

As ever, the RTM is a great start - it's such an attractive visualisation of the grid. This time I had no trouble getting it to work on my mac, although it only ever uses the primary display, so I had to put my laptop into mirror display mode, then flip back to two screens for the talk - a minor niggle.

I tried to speak about EGEE in a general way, introducing the project, the services delivered - even a slide on the "Life of a Job", problems commonly seen and operational aspects. The idea was to introduce the grid to potential users, rather than site admins or service providers. This seemed to go pretty well - no one obviously fell asleep (!) and there were a number of questions about EGEE and other grids, job efficiencies and data volumes. Later some of the graduate students complimented me on a good talk, so it must have been pitched at the right level.

I had intended to mention Byzantine Generals, but it slipped my mind. Single points of failure, eh...

Steve Lloyd was still having problems writing into Durham's DPM. I nabbed an ATLAS user and got them to help me, which has revealed an incredibly weird problem:

If the user is unknown to the DPM then lcg-cr will fail, with the cryptic error "transport endpoint not connected". Attempting to srmcp reveals an authentication failure: "SRMClientV1 : CGSI-gSOAP: Could not find mapping for: USERS_DN". Somehow the srm daemon is refusing to authenticate the user properly. It's actually quite a deep problem in the GSI chain, because the daemon doesn't even get as far as logging anything about the connection.

Drilling further down, to globus-url-copy, then revealed a bizarre work around: a globus-url-copy command will create the mapping from the user's DN to a pool account. After this is done everything starts to work properly.

We've checked to see if it's a problem with VOMS proxies and it isn't. It also doesn't seem to affect DPNS itself - even users who can't copy in files can do a dpns-mkdir and they get a new entry in Cns_userinfo fine.

Very mysterious.

Tuesday night: It's Linux Users Victoria.

These guys (and of the 50 or so there, 49 are guys) are uber-geeks. Talking to them is a lot of fun - they like the physics, they like the LHC, they even like the grid. Some of them want grid certificates. Personally I think we should give certificates to them - they'd probably fix lots of stuff!

The GridPP Real Time Monitor was a big hit.

Right, before the suntan gets washed off and all memory of antipodean adventure fades, better blog a bit about my vist to Melbourne Uni on the second week of the trip.

First Monday, it's VPAC. "Data Management and Storage in EGEE" is my talk.

VPAC is the Victorian Partnership for Advanced Computing. I'm giving much the same talk as I did at AUSGRID, but with some particle phyiscs at the beginning. This is partly to flesh things out, but mostly because the LHC is so cool it makes the storage and data management parts of the talk more interesting, having set the context for why we are doing what we're doing. This is my first talk pretending to be a particle physicist - remarkable what an afternoon browsing the Standard Model on wikipedia can produce ;-)

The audience are mostly coming from an HPC background, so they are more used to providing compute resources and fast interconnects, but now they have an interest in providing cross-site storage, rather and some dedicated disk hanging off their HPC. Oh, and they've heard about the grid.

Again, the scale of the EGEE project is generally appreciated - and by implication that the solutions we've adopted do scale. I'm specifically asked about why we ddn't use SRB. This is hard to give a definitive answer on as I don't know enough about SRB, however, the EGEE solution (usefully charactarised by Jens as component based) is that:

It allows poorly performing components to be replaced individually, e.g., the replacement of the RLS with the LFC.
It allows sites, whose storage is also serving local users and legacy applications, to layer an SRM interface on top of their storage, rather than have to implement a completely new system.
It makes it easier to deploy at a large number of sites, where agreements across the whole grid are often difficult to achieve.

In the discussion it also becomes clear that SRB doesn't offer any good methods of data localisation - there's no easy way to ensure that jobs at a site can access SRB data efficiently. There doesn't seem to be an SRB LAN access protocol, like rfio or dcap.

After the talk I discuss SRB a little more with David Bannon. He says that the SRB model of federations of storage is actually quite fragile and only seems to work if all the SRBs are running exactly the same release - so it's back to big bang upgrades to maintain.

Tada! GET /alt/echo.php?msg=cfengine%20post%20start HTTP/1.0

"so what?" I hear you ask - well, it finally means that kickstart is automagically[1] installing the cfengine rpms and doing the local post install voodoo.

The problem? I needed to create a "local packages" repo for our SLC4X tree - we had one for 3.x but I needed to go and get a copy of "createrepo". to create the magic YUM Metadata. Now to finsh setting up cfengine to customise the node.

Tuesday, February 13, 2007

Spent most of yesterday performing some housekeeping on the Edinburgh dCache. See my gridpp-storage post for the gory details. In summary, draining pools is slightly easier than it was before, but isn't quite for the faint of heart.

Also, I have a meeting today with the guys from EUCS. Still having problems with the SAN due to some weirdness that is going on with partition tables on some of the volumes that they have given us. For some reason fdisk and parted keep on barfing when I try to wipe everything and start again. It's strange since this is only happening with 3 of the volumes, the remainder seem fine. Also need the to configure the multipathing.

Sunday, February 11, 2007

Wow! It's been a busy time here in Melbourne, but just time to blog before getting back on the plane to Scotland.

The first week I was at the ACSW conference in Ballarat. It turned out that AUSGRID was only a 1 day event as part of that, so in fact the primary reason (excuse?) for coming turned out to be the least interesting thing. There were a few good papers as part of that, though. The keynote was by Denis Caromel, from Nice, describing how they have grid "enabled" java, in a package called ProActive. Essentially it replaces all of those awful MPI calls with nice, easy to manipulate java objects. Looks like a splendid way of doing parallel code on the grid. Probably not of huge interest in LCG, where we're mostly embarassingly parallel, but for communities coming from a supercomputing background, or just starting out with grids, it might be a very valuable toolkit to use.

The paper I gave went down well and there was general appeciation of the scale of EGEE and the volume of work we are doing. There was some discussion on SRM vs. SRB, a theme that contunied into my second week.

I met with Lyndsay Hood, the acting program manager of the Australian Partnership for Advanced Computing (APAC), for lunch. We had an interesting chat about their storage issues, where the thing that seems to put them off the EGEE solution is the X509 authentication. Too difficult for normal users, they seem to think. This is something I've heard from people in the UK as well, but no one seems to have a better answer: portals are too limiting, username/password doesn't scale. Shiboleth gets talked about all the time, but it seems to be spoken of as a panecea, rather than a working product that people can use right now.

A general comment on computer science conferences: I've had enough graph theory to last me a lifetime.

Thursday, February 08, 2007

This post is a bit late, but I figured people would be interested.

I started to re-organise Edinburgh's dCache at the end of last week. I've been meaning to do this since purchasing a qLogic HBA just before the holidays but have only now managed to find time to do the work. The new card is a QLA3242 (2Gb, dual port) and our intention is to use it to connect a dCache pool node
(pool2.epcc.ed.ac.uk) directly to the University SAN via fibre channel, rather than using NFS (which we saw led to lots of problems with stale file handles and terrible write rates).

The installation went fine, kudzu did it's job well and immediately picked up the new card after a reboot. We are using the qla2300 module, loading it at boot time. There were teething problems in seeing the new scsi devices from the SAN, but these have since disappeared. Still need to do a couple of things before getting the storage into dCache. I need to properly setup multipathing (any advice would be appreciated) and also to format the 1.6TB volumes. Each volume is it's own independent set of disks which eases the configuration issues slightly since I don't have to worry about simultaneous reading/writing to the same disk.

All of our IBM disk is now hanging off of pool1 again and this is our currently our only gridftp door. Strangely I've not seen any CLOSE_WAITs recently....

Thursday, February 01, 2007

Spent some of the week doing sysadmin work on the new cluster. Discovered that not all raid cards are configured equal thanks to a set of scripts using Net-SNMP. Although Areca are kind enough to supply the MIB Files for the Raid cards, they're badly written and contain underscores in the definitions. I already discovered a workaround last week but as I've just installed Nagios on the master node (together with the magic mod-ssl incantation to make parts of the site only accessible to named DN's) I discovered that the check_snmp plugin doesn't work with it.

One quick hack of the source file to add in

asprintf (&command_line, "%s -Pu -t %d -r %d -m %s -v %s %s %s:%s %s",

and it works as well as can be expected. Raised a sf.net ticket to try and get something better propogated upstream.

Now working on a Makefile and a pile of shell scripts to automagically generate a set of nagios configs to check each node before Paul integrates it all with MonAMI. Crib notes and more info available on the GridPP Nagios Page

ScotGrid