Monday, March 26, 2007

Gatekeeper Troubles at Durham

Durham went on the blink at about 1am today - suddenly failing JL. The error message was the usual erudite globus effort: Got a job held event, reason: Globus error 3: an I/O operation failed.

Well, it looked straight forward enough - it's an i/o error, right. I found a lot of hints on google that this was caused errors in transferring in the sandbox. So check gridftp, home directory quotas, etc. Mark and I spent lots and lots of time on this, checking different things, becoming more and more confused (ok, so gridftp of a file works, can I make a directory using edg-gridftp-mkdir? have we restarted the gatekeeper properly? what's bound to ports 2811 and 2119? etc., etc.).

In the end we just could not fathom what had gone wrong, so I suggested to Mark that he email LCG-ROLLOUT and TB-SUPPORT.

Maarten Litmath pointed us to a GOC Wiki article which also said that this i/o error could occur when the CE was short of memory. I have found the culprit code in the l_check_memory function in Helper.pm - it produces a failure if the free memory (swap + physical) on the CE is less than 20% of the total. However, this error is not passed up the stack properly (in fact the code in queue_submit() returns undef) and so an entirely misleading error is passed back which wasted hours of our time. Grrrr.

I was reminded of a Alice in Wonderland...

'When I use a error message,' Humpty Dumpty said, in rather a scornful tone, `it means just what I choose it to mean -- neither more nor less.'

`The question is,' said Alice, `whether you can make error messages mean so many different things.'

`The question is,' said Humpty Dumpty, `which is to be master -- that's all.'

Alice was too much puzzled to say anything; so after a minute Humpty Dumpty began again. `They've a temper, some of them -- particularly Globus errors: they're the proudest - batch system errors you can do anything with, but not Globus errors - however, I can manage the whole lot of them! Impenetrability! That's what I say!'

`Would you tell me please,' said Alice, `what that means?'


I have submitted a GGUS ticket - these things won't improve unless they are complained about: https://savannah.cern.ch/bugs/index.php?25048.

1 comment:

Ravi said...

So do you think queue_submit() should return a better error that could be mapped to a rather decent globus error instead of sending a undef ?