After investigation we noticed the 'cannot download .BrokerInfo from' error. A quick look on a node proved that it was owned in /tmp by another user rather than ops. A strace -f -p NNNN on the globus-url-copy command process showed the ops job was getting a permission denied when trying to create/copy the file. A look at past CE-sft-broker tests showed the a very clear difference, in fact there was a missing directory!
-rw-r--r-- 1 sgmops001 opssgm 3085 Aug 31 05:06 /tmp/https_3a_2f_2fwms208.cern.ch_3a9000_2fElSbIsNqd8SN69eCXPN1JA/.BrokerInfo
-rw-r--r-- 1 sgmops001 opssgm 2312 Sep 1 22:34 /tmp/.BrokerInfo
Removing this file allowed the ops test to run but why it was happening was still a mystery. A work around we have deployed is to create an additional directory in cp_1.sh i.e.
# Workaround for gLite WMS jobs, which don't cd into EDG_WL_SCRATCH...
echo In cp_1.sh
echo Making temporary work directory
templ=$TMPDIR/glite_run_XXXXXXXX
temp=$(mktemp -d $templ)
echo Changing work directory to $temp
cd $temp
In the end we had to remove every blocking .BrokerInfo file from /tmp across the cluster and ops jobs started passing again. Further digging showed that the job wrapper has changed somewhere along the line. The old job wrapper had code like this in it.
#if [ ${__job_type} -eq 0 -o ${__job_type} -eq 3 ]; then # normal or interactive
newdir="${__jobid_to_filename}"
mkdir ${newdir}
cd ${newdir}
#elif [ ${__job_type} -eq 1 -o ${__job_type} -eq 2 ]; then # MPI (LSF or PBS)
#fi
This has now been removed and could be causing issues for other sites. Torque and SGE have functionality to ring-fence every job perhaps we would have been safer using it but running jobs from /tmp worked for 3 years. Not any more it would seem.