Friday, November 17, 2006


We're back!

I had cfengine installing and configuring the CE and the batch system by early Wednesday, but neither the gatekeeper or the information system were working at all. Then at this point I had to leave things until Thursday afternoon (teaching, writing exam questions and having teeth removed (ouch!)), so I didn't get a chance to look at the gatekeeper until Thursday afternoon.

I couldn't find any significant difference between the files on the old production site and the new system, so what was wrong? Eventualy I attached an strace to the gatekeeper which revealed it could not open the jobmanager-fork file. Checked the permissions and they were 0600!

Turns out that cfengine runs shell scripts with a very aggressive umask of 077, so it had created many configuration files only readable by root. Gatekeeper forks the process as the pool account and the job falls flat on its face...

This is defintely a cfengine gotcha!

Judicious use of find for non-world readable files allowed me to fix things up.

The information system was suffereing the same problem as it was unable to execute the lcg-info-generic script.

It seems this was also causing job aborts for grid jobs on the WNs, as they too were running YAIM with the bad umask. I blasted them with kickstart and cfengine this morning and reinstalled the whole cluster in < 1 hour. Cool!

No comments: