Tuesday, March 13, 2007

Torque/Maui Upgrade Lost Jobs?

Was the torque upgrade was not as smooth as I'd hoped?

It should have been just a minor upgrade, and thus pretty transparent, but some issues have arisen.

Firstly one of our local engineering users reported that the gatekeeper lost contact with all of his jobs - the jobs were still running in the batch queue, but globus-job-status reported them all done, and he couldn't get any output back.

Then I noticed that on the ATLAS production monitor page our 24 hour efficiency droped to the lowest ever level at 24%. This makes me rather worried that we lost all of our ATLAS jobs if the gatekeeper had a brain haemorrhage.

On the other hand, efficiency in the UK seems generally very low right now (ce02.tier2.hep.manchester.ac.uk, 26%; ce1.pp.rhul.ac.uk, 24%; fal-pygrid-18.lancs.ac.uk, 17%; lcgce01.gridpp.rl.ac.uk, 14%), so perhaps this is just a coincidence?

Doesn't explain the globus issues seen by our local user though.

Does anyone know the magic for getting into the guts of the gatekeeper and seeing which torque jobs it's connected to?

1 comment:

Graeme Stewart said...

Chatted with Steve Traylen about this and he had never heard of the gatekeeper losing contact with jobs, nor could he see how it could happen.

So the low ATLAS efficiency was probably just coincidence. Most of the UK seems to have recovered to ~50% now (terribly low, really...).