Monday, December 10, 2007

One bad apple, sitting in a rack...

Andrew did great work getting all the nodes back on line and dealing with the quirks of cfengine and reconfiguring everything.

Unfortunately we failed 2 SAM tests, with the infamous "Cannot read JobWrapper output, both from Condor and from Maradona". I checked the torque logs and both of these tests ran on node016 - so looked like this was the bad apple.

When I checked, it was clear that yaim had not run, so the PATH was bad (perhaps this was before Andrew fixed cfengine)?

Quick spin with cfengine and "-Drunyaim" and the node was good again.

The existence of links from /etc/profile.d/grid-env.{csh,sh} is an excellent proxy for YAIM having run correctly, so we should implement this as a cfengine test.

