There's a pre-amble to this post which isn't really to do with ScotGrid directly, but has some bearing subsequent event. If I tell you the actors were a glass of milk, a three year old and my laptop keyboard then you can doubtless assemble the plot yourself. So yesterday some important keys were completely non-functional on my laptop - one of these keys was a character in my password so I wasn't able to login at all. Despite my best MacBook disassembling and cleaning efforts I couldn't recover the keyboard - there seems to be gunk etched onto the conducting keyboard sheet. Now that I am back in the office I can use a USB keyboard, but I need to get a new MacBook one quickly.
Anyway, upshot is that my online activities were squeezed into an ancient IBM R32 laptop running XP with 256MB of RAM. So, I was not watching (bookmarks were effectively unavaliable) when the CE ran out of disk space and we went belly up for SAM tests yesterday lunchtime.
Problem was easily identifed this morning and, thanks to the magic of LVM, corrected. When the machines were setup /var was not separate from /, so we've now added a separate /var logical volume with 30GB in it.
We looked into why this did not trigger and alarm and it seems that there was a mistake in the nagios configuration which was looking at free inodes, rather than free space. Corrected now.
I also took the opportinutity to re-run YAIM on the CE and fix the broken VOMS mappings. My atlas production role now maps correctly to an prdatlas account. This might cause a failure to clear from globus the stack of jobs running under my "local" gla012 account, but the actual production system doesn't require any outputs from globus so the real payload will run fine.
Sadly, the segfaulting on the CE still happens, however it seems the the problem was spotted earlier and that a fix is on the way: https://savannah.cern.ch/bugs/?35981.