Tuesday, May 27, 2008

So, how was it for you...?

I made some comments on Glasgow's site availability from October last year until now. It was Q1 this year that was very hard on us (strictly last week of Feb until end of March) - we lost 5 days to the CE crash and reinstall of core nodes, then were pestered by minor networking and user level problems. All of these conspired to reduce our availability to 83% that quarter.

However, we seem to be very much on top of things now, with 95% so far for Q2 and 100% in the last week (we were lucky that our servers had certificates from the old CA though).

I suppose this also accounts for the lack of blog entries this month - things are going well.

Full steam ahead, Captain. The weather is fine...

Thursday, May 08, 2008

liblcas_lcmaps fix

There is a fix for the segfaulting bug I reported against the globus gatekeeper and gridftp server, https://gus.fzk.de/pages/ticket_details.php?ticket=35694.

If you download the patched version of liblcas_lcmaps_gt4_mapping.so.0.0.0 and install it to /opt/glite/lib then the problem is resolved.

I applied the patch last night and since then we haven't suffered a single segfault.

Maarten said that the official fix should reach production in the next couple of weeks, but I'm happier having it on site now.

Wednesday, May 07, 2008

CRHell...

So, even after fixing our CE disk space problems we still were failing SAM tests. The error was one I'd never seen before: "7 an authentication operation failed".

The GOC wiki hinted at a few things which can cause this, but as it only seemed to affect SAM tests (Steve Lloyd jobs and my ATLAS production were both running fine) there wasn't a lot that could really be debugged locally - we had even had a successful SAM test from Rafa's SAM Admin interface at 11am.

One of the classic X509 errors though, is CRLs being out of date. When I checked the CRL files on the CE is was clear something was amiss. A few were dated today, but most were approaching 5 days old. When I ran the CRL updater by hand I got the error:

fetch-crl[19144]: 20080507T154251+0100 updating CRL 'CERN Trusted Certification Authority (1d879c6c)'
fetch-crl[19144]: 20080507T154251+0100 File /etc/grid-security/certificates//1d879c6c.r0 valid: no
fetch-crl[19144]: 20080507T154251+0100 Attempt to overwrite /etc/grid-security/certificates//1d879c6c.r0 failed since the original is not a valid CRL file

What's worse is that not only did this CRL fail to update, it caused the updater script to bomb and no CRL after this point even attempted to update!

In the end I had to delete all of the CRLs on the CE and re-run the script to get fresh copies.

What exactly happened I do not know, but the relevant command in fetch-crl is:
openssl crl -hash -in CERT_REVOKE_FILE -noout -inform PEM -text
This should produce, as a first line, the hash for the CRL, which is the same as the file name. however, testing one of the bad files I got instead:
unable to load CRL
12697:error:0906D066:PEM routines:PEM_read_bio:bad end line:pem_lib.c:731:
So clearly they had gone evil in some way.

I think this is poor behaviour from the script, so I have submitted a GGUS ticket (https://gus.fzk.de/pages/ticket_details.php?ticket=36191).

X509 error messages just suck so badly.

It's all gone 'orribly wrong

OK - My bad. I spotted we failed a sam test yeterday (got a mail from the automated alert) - didn't realise it doesn't send multiple ones if you keep failing....

Am sure Graeme will post more but we'd filled up / (as /var wasn't on a separate partition - it is now, and a nice healthy 30G) on the CE. Puzzlingly nagios hadn't bothered to alert that we'd gone warning at 8% free or critical at 4% free and was "OK - /0 free"

case sensitivity in check_disk: -w is for disk space, -W is for inodes. grr. Typo-tastic. I lowercased the offending config and let cfengine ripple it out. While it did so I noticed cfengine restarted ntpd on the 3 nat boxes (that also act as the timeservers for the cluster) - somehow it was copying both a standard then a local /etc/ntp.conf into place each time and restarting as planned on a new config file.

my bad - we use class 'natboxes' for the group and I'd specified any.!(master|nat):: changing it to any.!(master|natboxes) worked fine - no restart since and none of the workernodes are seeing any upstream timeservers on INIT or LOCAL

Holy lactating keyboards Batman, there goes the disk!

There's a pre-amble to this post which isn't really to do with ScotGrid directly, but has some bearing subsequent event. If I tell you the actors were a glass of milk, a three year old and my laptop keyboard then you can doubtless assemble the plot yourself. So yesterday some important keys were completely non-functional on my laptop - one of these keys was a character in my password so I wasn't able to login at all. Despite my best MacBook disassembling and cleaning efforts I couldn't recover the keyboard - there seems to be gunk etched onto the conducting keyboard sheet. Now that I am back in the office I can use a USB keyboard, but I need to get a new MacBook one quickly.

Anyway, upshot is that my online activities were squeezed into an ancient IBM R32 laptop running XP with 256MB of RAM. So, I was not watching (bookmarks were effectively unavaliable) when the CE ran out of disk space and we went belly up for SAM tests yesterday lunchtime.

Problem was easily identifed this morning and, thanks to the magic of LVM, corrected. When the machines were setup /var was not separate from /, so we've now added a separate /var logical volume with 30GB in it.

We looked into why this did not trigger and alarm and it seems that there was a mistake in the nagios configuration which was looking at free inodes, rather than free space. Corrected now.

I also took the opportinutity to re-run YAIM on the CE and fix the broken VOMS mappings. My atlas production role now maps correctly to an prdatlas account. This might cause a failure to clear from globus the stack of jobs running under my "local" gla012 account, but the actual production system doesn't require any outputs from globus so the real payload will run fine.

Sadly, the segfaulting on the CE still happens, however it seems the the problem was spotted earlier and that a fix is on the way: https://savannah.cern.ch/bugs/?35981.

Monday, May 05, 2008

Farewell ScotGrid-Edinburgh!

Not withstanding the problems this week at ECDF, in general the new Edinburgh resource has been working well. Consequently maintaining the old site was just a drag on our people, with no real purpose anymore. Sam broadcast the intention to close the site a couple of weeks ago and we changed its status in the GOC to "closed".

Greig and Sam have ensured that the site's old storage is accessible through ECDF, though we've advised VOs to move their data off this SE as the hardware is ageing and unreliable.

Edinburgh is dead! Long live Edinburgh!

ECDF down for the moment

ECDF have been having real trouble with GPFS in the last week, which gave us some miserable results (23% pass rate on SAM, c.f., UK average of 75%). For the moment the systems team have suspended job submission and the site went into downtime on Friday.

This may or may not be related to the problems we see with the globus job wrapper code on ECDF, where the GPFS daemon consumes up to 300% CPU due to a strange file access pattern in the job home directory. Sam is working on installing an SL4 CE (based on the GT4 code) to see if this improves matters.

Who's watching the watcher?

We had a problem on the DPM headnode with the new VOMS certificate for Zeus not being installed. When I checked I found it was in the repository, but had not been copied to the server. What gives? It turned out that the cfengine version we had on svr018 (2.2.3-1.el4) was not defining the gridsrv class properly (through the HostRange expansion), so then the grid class was not defined and consequently the vomsdir was not being checked.

I downgraded to 2.1.22 and this fixed the problem. But there is a mystery here - why does 2.2.3-1 work fine on the worker nodes?

I see 2.2.6 has been released. Maybe we should go back to rolling out own build of cfengine?