Unfortunately this is a known problem with SAM CE JobWrapper Tests and R-GMA. We are using R-GMA command line utility to publish a small piece of data from worker nodes but unfortunately sometimes R-GMA hangs for quite a lot of time.
We are planning a new release of JobWrapper tests without R-GMA publishing (replaced completely by our internal SAM/GridView transport mechanism). But for the time being the only solution is to disable JobWrapper tests on your site if you observe such a behaviour.
To do this you have to remove all the symlinks that appear in the following two directories on all WNs:
$LCG_LOCATION/etc/jobwrapper-start.d
$LCG_LOCATION/etc/jobwrapper-end.d
I have now put in the necessary cfengine stanza to delete the links and stop this nonsense:
disable:
worker::
# Disable the SAM wrapper which uses R-GMA
/opt/lcg/etc/jobwrapper-start.d/01-same.start
/opt/lcg/etc/jobwrapper-end.d/01-same.end
Now, how big a difference does it make? Quite a lot for short jobs - the wallclock time for a simple globus-job-run has gone down from 5 minutes to 3 seconds!
The total time for the jobmanager to handle the job has remained quite high - 1m30s c.f. 16s for the pbs jobmanager. However, at least no one is going to be "charged" for the time that the job is with the gatekeeper, unlike the time spent in the batch queue.
No comments:
Post a Comment