Friday, November 10, 2006

Why do we need cfengine? I got an email from Rod Walker (ATLAS) alerting us to the fact that two workers had become job blackholes. Investogation showed they didn't have CA certificates installed, so they would not globus-url-copy a job's output back to the gatekeeper.

Now, I did run a distributed shell install of lcg-CA on these nodes, but with 100 batch workers I failed to notice that this had somehow failed on 2 of the nodes.

With cfengine we have built into the system:

packages:
grid::
lcg-CA action=install
glite-yaim action=install elsedefine=runyaim

So it will check every hour if lcg-CA is properly installed and install it for us if it is not.

Note also how we can define the "runyaim" class when yaim is first installed - this will run yaim automatically after the metapackage is installed (and then triggers the switching on of the batch system).

More details in the wiki soon...

No comments: