Monday, February 04, 2008

cluster glue

hmm. Freudian? I originally typed 'cluster clue' as the title.

Regular readers will be aware that we run both ganglia and cfengine. However even our wonderful rebuld system (YPF) doesn't quite close off all the holes in the fabric monitoring. case in point - reimaged a few machines and noticed that ganglia wasn't quite right. It'd copied in the right gmond.conf for that group of machines but hadnt checked that it was listed in the main gmetad.conf as a data_source,

Cue a short Perl script (soon to be available on the scotgrid wiki) to do a sanity check, but it;s this sort of non-joinedupness of all the bits that really annoys me about clusters and distributed systems.

Are there any better tools? (is Quattor the savoiur for this type of problem)

/rant

Saturday, February 02, 2008

nagios event handlers

I've gone over to the Dark Side (no, not Python) and have just implemented my first nagios event handler - This *should* automatically try and fix the problem we have with our Dirvish backup scrips - namely that we end up with too many copies of the database dumps held.

So - cue nagios' event handlers - the only issue should be that nagios (and the event handler) runs as user nagios and most sysadmin stuff needs root. If you're willing to trust it to sudo then it should be OK.

Friday, February 01, 2008

DPM Storage Token Information Published

It was a bit fiddly, but with 4 fixes to Michel's script, Glasgow are publishing the space token information for ATLASDATADISK.

I documented the workarounds in the LCG Twiki.

Thursday, January 31, 2008

perl-TermReadKey - missing in action

We had a ticket from a Zeus user unable to get a file off our DPM, while her Zeus colleagues could. I spend a long time checking the pool accounts, which were all fine, and checking the Zeus VOMS setup, which was also fine.

Finally, I looked in the logs for the grid-mapfile, where the culprit lay:

"Can't locate Term/ReadKey.pm in @INC..."

On two of the servers, disk034 and disk036, the perl-TermReadKey RPM was missing and it looks like the grid-mapfiles had not been rebuilt for a very long time - from the backups it looks like it was October when they were remade!

OK, nagios check: age of grid-mapfile and lcgdm-mapfile!

Down yum! Down!

As reported previously, We discovered that the nightly yum update was enabled on the servers.

The magic cfengine snippet to disable this is:


groups:
# we don't want auto yum update stuff
nonightlyyum = ( `/usr/bin/test -f /var/lock/subsys/yum` )

shellcommands:
nonightlyyum::
"/sbin/chkconfig yum off" umask=022
"/sbin/service yum stop" umask=022


The same check-for-some-enabled-subsys trick can be used to disable many of the periodic checks run on a standard SL install. (it's what 'service foo status' does for many of them)

Wednesday, January 30, 2008

SRM2.2 Configuration for FDR/CCRC

I configured Glasgow's DPM for the SRM 2.2 space tokens required by ATLAS for FDR/CCRC:

svr018:~# dpm-reservespace --gspace 10T --lifetime Inf --group atlas --token_desc ATLASDATADISK
ab0f1a60-59d6-4099-82fa-a17711678860

Easy, eh!

I notice that there is no dpm-listspaces command, which means that one has to grub around inside the database to find out what spaces are currently defined.

Two additional notes for other T2s:

  1. Transfers from DDM are done using a vanilla atlas proxy for now (belonging to Mario Lassnig), so make sure the token is writable by the atlas group, not, e.g., atlas/Role=production.
  2. All that is needed for CCRC is 3TB, however this is based on a 1 week cleaning cycle. If, like Glasgow, you have lots of space, then making the area bigger means the cleaning is not critical. (The space can be updated later with dpm-updatespace.)

Tuesday, January 29, 2008

YAIM stale...

Another problem I found with configuring the new VOs is that the reconfiguration of the information system on the CE failed. Mike had run config_gip, as has been done forever, but it did nothing. So queues and access rights for the VOs were not published.

There are new YAIM functions, like config_gip_ce, as well as a whole different way of structuring site-info.def. I fiddled with various options in the site-info.def file and even in the YAIM function itself, but I couldn't get it to work properly at all.

In some desperation I hacked the ldif file in the end, which is not a long term option; we really need now to look at the whole way that we structure YAIM for the site (as well as plan migrations to SL4 for BDII, CE and DPM...).

OPS goes dark...

Following enabling other VOs yesterday, DPM broke for ops. The error messages were as cryptic as ever:

httpg://svr018.gla.scotgrid.ac.uk:8443/srm/managerv1: Unknown error

And in the SRM logs:

01/29 12:29:02 14830,3 srmv1: SRM02 - soap_serve error : Can't get req uniqueid
01/29 12:05:18 14830,0 srmv1: SRM02 - soap_serve error : CGSI-gSOAP: Could not find mapping for: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=samoper/CN=582979/C
N=Judit Novak

The error seemed to correspond to re-running config_mkgridmap yesterday, however, as Judit (and other ops people) were in the grid-mapfile I was very confused.

Eventually, staring at the lcgdm-mkgridmap.conf I realised that the ops VO was only configured to get voms information from the deprecated lcg-voms.cern.ch server. I reconfigured to get the information from voms.cern.ch and it started to work.

The think I cannot fathom is how is kept working for so long - we have always had lcg-voms.cern.ch as the server for ops.

I updated the ops entry on the GridPP Wiki.

As usual the things which made this much harder then it should have were:

1. It only affected the ops VO (not dteam, atlas or pheno which we can test).

2. The error message was, as usualy, cryptic and unhelpful.

Keeping up with the Jones'

Well we recently had an incident with our NFS server for the cluster (home / software) locking up and needing a cold power cycle. Due to $vendors setup this takes aaaages (in the order of 20 mins) to go through the BIOS selfcheck (hangs at 053C). $vendor would like to poke around system and perhaps perform bios upgrade. Hmm. Oh well, all 10 disk servers are identical so we'll just drain one down and play - it also gives us chance to upgrade (from 1.6.5) to the latest 1.6.7-mumble DPM.


... or so we thought.

disk032:~# rpm -qa | grep DPM
DPM-gridftp-server-1.6.7-1sec
DPM-rfio-server-1.6.7-2sec.slc3
DPM-client-1.6.7-2sec.slc3


"Thats odd - Graeme have you updated these?" nope - Turns out that yum.nightly cron was auto updating on both the disk servers and some of the grid servers... Gaaah. clickity click and we're all ready to play.

In the meantime, dpm-drain migrated most of the data off the server to the other stash of disks but there were still 69 files that failed with 'Internal error' - Am looking through the DB to try and see if I can pull any more info out

Monday, January 28, 2008

Take Me To Your VO...

As noted preciously, we have enabled mice, vo.scotgrid.ac.uk and vo.nanocmos.ac.uk at Glasgow. Mike worked his way through the documentation, which was a out of date because it hadn't incorporated the changes in VO pool account management (which has become much better) which Andrew implemented.

We also plan to setup VOs much more promptly now for local user groups. Now that we have a scotgrid VOMS server this will be much easier.

The wiki has all the updated details.

Resource Broker Blues

Our resource broker was down for the weekend as the network service stalled. Root cause turned out to be a bit of over aggressive cleaning from cfengine. I had wanted to do a better job of cleaning up the /tmp area in the cluster - each worker node had hundreds of condor_g working directories lying around - with nothing in them. cfengine's "tidy" leaves directories alone by default and only cleans files. So I enabled the "rmdirs=sub" option - works beautifully, gets rid of all the cruft in /tmp. So pleased was I that I disengaged by brain and set this option on for /home as well - good idea to clean up those old gass cache areas, isn't it? Well, almost - unfortunately /home has subdirs which are the node pool account home areas and unused pool accounts fall into the clean me up category. All the untouched pool areas then vanished.

This caused a number of people to start getting "unspecified grid manager errors" on globus-job-runs, as well as wiping out the edguser home area on the RB which caused the network server to go into crisis.

It didn't take long to work out what had happened, but fixing it took a while as the resource broker seemed to be quite huffy afterwards.

The only plus side was that I enabled the mice, scotgrid and nanocmos VOs on the RB.

Saturday, January 05, 2008

Happy New Year ScotGrid - now with added ECDF...


Well, we didn't get it quite as a Christmas present, but the combined efforts of the scotgrid team have managed to get ECDF green for New Year.

In the week before Christmas Greig and I went through period of intensive investigation as to why normal jobs would run, but SAM jobs would not. Finding that jobs which fork a lot, like SAM jobs, would fail was the first clue. However, it turned out not to be a fork or process limit, butn a limitation on the virtual memory size which was the problem. SGE can set a VSZ limit on jobs, and the ECDF team have set this to 2GB, which is the amount of memory they have per core. Alas for jobs which fork, virtual memory is a huge over estimate of their actual memory usage (my 100 child python fork job registers ~2.4GB of virtual memory, but uses only 60MB of resident memory). That's a 50 fold over estimate of memory usage!

As SAM jobs to fork a lot, they hit this 2GB limit and are killed by the batch system, leading to the failures we were plagued by.

A work around, suggested by the systems team, was to submit ops jobs to the ngs queue, which is a special short running test queue (15 min wall time) which has no VSZ limit on it.

Greig modified the information system to publish the ngs queue and ops jobs started to be submitted to this queue on the last day before the holidays.

Alas, this was not quite enough to get us running. We didn't find out until after new year that we also needed to place a specify a run time limit of 15 minutes on the jobs and submit them to a non-standard project. The last step required me to hack the job manager in a frightful manner as I really couldn't fathom how the perl (yuk!) job manager was supposed to set the project - in fact even though project methods existed they didn't seem to emit anything into the job script.

Finally, with that hack made this morning, ECDF started to pass SAM tests. A long time a coming, that one.

The final question, however, is what to do about this VSZ limit. The various wrappers and accoutrements which grid jobs bring mean that before a line of user code runs there are about 10 processes running, as 600MB of VSZ has been grabbed. This is proving to be a real problem for local LHCb users, because ganga forks a lot and also gets killed off. Expert opinion is that VSZ limits are just wrong.

We have a meeting with the ECDF team, I hope, in a week, and this will be our hot topic.

Big thanks go to Greig for a lot of hard work on this, as well as Steve Traylen, for getting us on the right track, and Kostas Georgiou, for advice about the perils of VSZ in SGE.

Thursday, December 20, 2007

ECDF Progress

We finally seem to be homing in on the problems at ECDF. Any job which forks off too many processes seems to die in the batch system. Launching a simple fork python job works fine at 20 children, but dies at 50. The same task at Glasgow runs happily with 100 children.

I can see there is no ulimit issue, but something is unhappy. We must track down if it is SGE or some gatekeeper weirdness.

Thursday, December 13, 2007

Grid-Monitoring Nagios tests


Finally (after a long interval) I reinvestigated getting the LCG Grid Service Monitoring Working Group nagios tests installed at Glasgow.

I had tried once before, but it needed nagios on our UI. This time I added a UI to our nagios host (nice n simple - simply added the hostname into the relevant UI group in cfengine). Works fairly well - I've got it installed and polling the SAM tests via the sam-API and picking up the results. I still need to get certificate proxy renewals working, and merge the records together with our existing definitions for the hosts (we use non-qualified names, the wlcg.cfg used FQDNs)

As the screen shot shows - We've already got a lot of green - and if we can nail the cert problems I'll switch it over to use the normal notification system

Monday, December 10, 2007

Glasgow joins NGS as partner site


Glasgow was approved as an NGS partner site at the last NGS MB meeting. It seems it took us a long time to get here, but with the addition of gsissh login for NGS members we had a full service which would be useful to NGS members. Our test record for the NGS Inca test suite was good.

David is drafting detailed documentation for NGS members, so watch this space...

One bad apple, sitting in a rack...

Andrew did great work getting all the nodes back on line and dealing with the quirks of cfengine and reconfiguring everything.

Unfortunately we failed 2 SAM tests, with the infamous "Cannot read JobWrapper output, both from Condor and from Maradona". I checked the torque logs and both of these tests ran on node016 - so looked like this was the bad apple.

When I checked, it was clear that yaim had not run, so the PATH was bad (perhaps this was before Andrew fixed cfengine)?

Quick spin with cfengine and "-Drunyaim" and the node was good again.

The existence of links from /etc/profile.d/grid-env.{csh,sh} is an excellent proxy for YAIM having run correctly, so we should implement this as a cfengine test.

140 worker nodes sitting in the rack...

140 workernodes sitting in the rack, and if one small workernode should accidentally kernel panic, there'll be 139 workernodes sitting in the rack. "WooHoo" - For the first time in AGES pbsnodes -l isn't listing any nodes as offline or down. Admittedly I have brought back into service one node that kernel paniced - but $vendor is taking soooo long to get back to us with faults we may as well see how long it takes before it dies again.

So in summary - Glasgow is batting at 560 for nought so far :-)

Friday, December 07, 2007

Java and gite-WN

After sorting out the hassle with cfengine earlier I still couldn't get nodes to build from scratch without intervention. gliite-WN (3.1.1.0) wasn't installing due to failed dependencies on java by bouncycastle. Steve T to the rescue (again) with his notes on JPackage. Sun have updated the jdk to 1.5.0.14 (there are quite a few CVE vulns before -13) but the jpackage is only available to -13. Luckily the CVS log of the spec file (available here) simply bumps the version number between 13 and 14 only. I therefore patched and compiled our own java-sun-1.5.0 package for the cluster.

Name : java-1.5.0-sun Relocations: (not relocatable)
Version : 1.5.0.14 Vendor: (none)
Release : 1jpp Build Date: Fri 07 Dec 2007 16:43:26 GMT
Install Date: (not installed) Build Host: svr031.gla.scotgrid.ac.uk
Group : Development/Interpreters Source RPM: java-1.5.0-sun-1.5.0.14-1jpp.nosrc.rpm
Size : 71332405 License: Sun Binary Code License
Signature : (none)
Packager : Scotgrid Glasgow
URL : http://java.sun.com/j2se/1.5.0
Summary : Java Runtime Environment for java-1.5.0-sun
Description :
This package contains the Java Runtime Environment for java-1.5.0-sun

it seems OK on the worker nodes - the three I rebuilt went straight to glite-WN 3.1.1.0 without any hassle. - just before the announcement that they've bumped the version number to 3.1.1.1. Typical.

cfengine hassle

We've had a strange problem with cfengine recently - it wasn't creating the symlink /usr/java/current -> /usr/java/javer_version_whatever. The log file complained of:

Error while trying to link /usr/java/current -> /usr/java/current

which is most odd as I was trying to link it to $(javaver) which evaluates to jdk1.5.0_12.

After much faffing about it turns out to be a bug in cfengine - it tries to create a circular link on the 1st item in the list and works fine for the 2nd one. as a result our cfengine config is now:

links:
java::
# BUG IN CFENGINE - it will fail the 1st line but do the 2nd OK
/usr/java/current_ver ->! /usr/java/$(javaver) type=relative
/usr/java/current ->! /usr/java/$(javaver) type=relative


which is ugly, hacky and just Wrong

Tuesday, December 04, 2007

ECDF networking troubles

Just a quick post to say that there was a problem with one of ECDF's network switches on Sunday morning just before 2am that brought down GPFS and also the NFS mounts on the CE. Luckily we're not using GPFS for the Grid storage, but the CE did have problems. The systems team are looking into ways of making things more resilient to failures like this in the future.