Thursday, December 20, 2007

ECDF Progress

We finally seem to be homing in on the problems at ECDF. Any job which forks off too many processes seems to die in the batch system. Launching a simple fork python job works fine at 20 children, but dies at 50. The same task at Glasgow runs happily with 100 children.

I can see there is no ulimit issue, but something is unhappy. We must track down if it is SGE or some gatekeeper weirdness.

Thursday, December 13, 2007

Grid-Monitoring Nagios tests


Finally (after a long interval) I reinvestigated getting the LCG Grid Service Monitoring Working Group nagios tests installed at Glasgow.

I had tried once before, but it needed nagios on our UI. This time I added a UI to our nagios host (nice n simple - simply added the hostname into the relevant UI group in cfengine). Works fairly well - I've got it installed and polling the SAM tests via the sam-API and picking up the results. I still need to get certificate proxy renewals working, and merge the records together with our existing definitions for the hosts (we use non-qualified names, the wlcg.cfg used FQDNs)

As the screen shot shows - We've already got a lot of green - and if we can nail the cert problems I'll switch it over to use the normal notification system

Monday, December 10, 2007

Glasgow joins NGS as partner site


Glasgow was approved as an NGS partner site at the last NGS MB meeting. It seems it took us a long time to get here, but with the addition of gsissh login for NGS members we had a full service which would be useful to NGS members. Our test record for the NGS Inca test suite was good.

David is drafting detailed documentation for NGS members, so watch this space...

One bad apple, sitting in a rack...

Andrew did great work getting all the nodes back on line and dealing with the quirks of cfengine and reconfiguring everything.

Unfortunately we failed 2 SAM tests, with the infamous "Cannot read JobWrapper output, both from Condor and from Maradona". I checked the torque logs and both of these tests ran on node016 - so looked like this was the bad apple.

When I checked, it was clear that yaim had not run, so the PATH was bad (perhaps this was before Andrew fixed cfengine)?

Quick spin with cfengine and "-Drunyaim" and the node was good again.

The existence of links from /etc/profile.d/grid-env.{csh,sh} is an excellent proxy for YAIM having run correctly, so we should implement this as a cfengine test.

140 worker nodes sitting in the rack...

140 workernodes sitting in the rack, and if one small workernode should accidentally kernel panic, there'll be 139 workernodes sitting in the rack. "WooHoo" - For the first time in AGES pbsnodes -l isn't listing any nodes as offline or down. Admittedly I have brought back into service one node that kernel paniced - but $vendor is taking soooo long to get back to us with faults we may as well see how long it takes before it dies again.

So in summary - Glasgow is batting at 560 for nought so far :-)

Friday, December 07, 2007

Java and gite-WN

After sorting out the hassle with cfengine earlier I still couldn't get nodes to build from scratch without intervention. gliite-WN (3.1.1.0) wasn't installing due to failed dependencies on java by bouncycastle. Steve T to the rescue (again) with his notes on JPackage. Sun have updated the jdk to 1.5.0.14 (there are quite a few CVE vulns before -13) but the jpackage is only available to -13. Luckily the CVS log of the spec file (available here) simply bumps the version number between 13 and 14 only. I therefore patched and compiled our own java-sun-1.5.0 package for the cluster.

Name : java-1.5.0-sun Relocations: (not relocatable)
Version : 1.5.0.14 Vendor: (none)
Release : 1jpp Build Date: Fri 07 Dec 2007 16:43:26 GMT
Install Date: (not installed) Build Host: svr031.gla.scotgrid.ac.uk
Group : Development/Interpreters Source RPM: java-1.5.0-sun-1.5.0.14-1jpp.nosrc.rpm
Size : 71332405 License: Sun Binary Code License
Signature : (none)
Packager : Scotgrid Glasgow
URL : http://java.sun.com/j2se/1.5.0
Summary : Java Runtime Environment for java-1.5.0-sun
Description :
This package contains the Java Runtime Environment for java-1.5.0-sun

it seems OK on the worker nodes - the three I rebuilt went straight to glite-WN 3.1.1.0 without any hassle. - just before the announcement that they've bumped the version number to 3.1.1.1. Typical.

cfengine hassle

We've had a strange problem with cfengine recently - it wasn't creating the symlink /usr/java/current -> /usr/java/javer_version_whatever. The log file complained of:

Error while trying to link /usr/java/current -> /usr/java/current

which is most odd as I was trying to link it to $(javaver) which evaluates to jdk1.5.0_12.

After much faffing about it turns out to be a bug in cfengine - it tries to create a circular link on the 1st item in the list and works fine for the 2nd one. as a result our cfengine config is now:

links:
java::
# BUG IN CFENGINE - it will fail the 1st line but do the 2nd OK
/usr/java/current_ver ->! /usr/java/$(javaver) type=relative
/usr/java/current ->! /usr/java/$(javaver) type=relative


which is ugly, hacky and just Wrong

Tuesday, December 04, 2007

ECDF networking troubles

Just a quick post to say that there was a problem with one of ECDF's network switches on Sunday morning just before 2am that brought down GPFS and also the NFS mounts on the CE. Luckily we're not using GPFS for the Grid storage, but the CE did have problems. The systems team are looking into ways of making things more resilient to failures like this in the future.