Wednesday, September 02, 2009

who changed the job wrapper?

It was a long night yesterday as Graeme and I tried to fix our failing ops CE tests. It started on Monday night when SAM mysteriously started failing across all CE's at Glasgow and then Durham. The jobs appeared to run but just stayed at the running state until the WMS presumably killed them and eventually failed ops tests.

After investigation we noticed the 'cannot download .BrokerInfo from' error. A quick look on a node proved that it was owned in /tmp by another user rather than ops. A strace -f -p NNNN on the globus-url-copy command process showed the ops job was getting a permission denied when trying to create/copy the file. A look at past CE-sft-broker tests showed the a very clear difference, in fact there was a missing directory!

-rw-r--r-- 1 sgmops001 opssgm 3085 Aug 31 05:06 /tmp/https_3a_2f_2fwms208.cern.ch_3a9000_2fElSbIsNqd8SN69eCXPN1JA/.BrokerInfo


-rw-r--r-- 1 sgmops001 opssgm 2312 Sep 1 22:34 /tmp/.BrokerInfo

Removing this file allowed the ops test to run but why it was happening was still a mystery. A work around we have deployed is to create an additional directory in cp_1.sh i.e.

# Workaround for gLite WMS jobs, which don't cd into EDG_WL_SCRATCH...
echo In cp_1.sh
echo Making temporary work directory
templ=$TMPDIR/glite_run_XXXXXXXX
temp=$(mktemp -d $templ)
echo Changing work directory to $temp
cd $temp

In the end we had to remove every blocking .BrokerInfo file from /tmp across the cluster and ops jobs started passing again. Further digging showed that the job wrapper has changed somewhere along the line. The old job wrapper had code like this in it.

#if [ ${__job_type} -eq 0 -o ${__job_type} -eq 3 ]; then # normal or interactive
 newdir="${__jobid_to_filename}"
 mkdir ${newdir}
 cd ${newdir}
#elif [ ${__job_type} -eq 1 -o ${__job_type} -eq 2 ]; then # MPI (LSF or PBS)
#fi

This has now been removed and could be causing issues for other sites. Torque and SGE have functionality to ring-fence every job perhaps we would have been safer using it but running jobs from /tmp worked for 3 years. Not any more it would seem.

5 comments:

rhcl said...

who will clean all those temporary directories? why not use $tmpdir in mom_priv? or the equiv in sge?

dug mcnab said...

That sounds like a better idea. The documented workaround requires the site to use cp_3.sh in the job wrapper to do the clean up. I shall take a look at $tmpdir in mom_priv. Cheers, Dug

dug mcnab said...

I have gone for cleanup in cp_3.sh until I have time to test the impact of using torque to sandbox each job.

rhcl said...

i moved to $tmpdir yesterday around 8pm with dgc-grid-40. and it worked fine. today i moved dgc-grid-44 to use $tmpdir as well. scratch directory seems much cleaner now. and jobs are succeeding.

dug mcnab said...

excellent news. I we have a discussion about using $tmpdir. It should work at Glasgow but we are not sure about the NGS users of our cluster which assume your job runs in $HOME. Switching this will make every job run in $tmpdir. I still think we will need to do some testing on our side before using it.