Durham went on the blink at about 1am today - suddenly failing JL. The error message was the usual erudite globus effort: Got a job held event, reason: Globus error 3: an I/O operation failed.
Well, it looked straight forward enough - it's an i/o error, right. I found a lot of hints on google that this was caused errors in transferring in the sandbox. So check gridftp, home directory quotas, etc. Mark and I spent lots and lots of time on this, checking different things, becoming more and more confused (ok, so gridftp of a file works, can I make a directory using edg-gridftp-mkdir? have we restarted the gatekeeper properly? what's bound to ports 2811 and 2119? etc., etc.).
In the end we just could not fathom what had gone wrong, so I suggested to Mark that he email LCG-ROLLOUT and TB-SUPPORT.
Maarten Litmath pointed us to a GOC Wiki article which also said that this i/o error could occur when the CE was short of memory. I have found the culprit code in the l_check_memory function in Helper.pm - it produces a failure if the free memory (swap + physical) on the CE is less than 20% of the total. However, this error is not passed up the stack properly (in fact the code in queue_submit() returns undef) and so an entirely misleading error is passed back which wasted hours of our time. Grrrr.
I was reminded of a Alice in Wonderland...
'When I use a error message,' Humpty Dumpty said, in rather a scornful tone, `it means just what I choose it to mean -- neither more nor less.'
`The question is,' said Alice, `whether you can make error messages mean so many different things.'
`The question is,' said Humpty Dumpty, `which is to be master -- that's all.'
Alice was too much puzzled to say anything; so after a minute Humpty Dumpty began again. `They've a temper, some of them -- particularly Globus errors: they're the proudest - batch system errors you can do anything with, but not Globus errors - however, I can manage the whole lot of them! Impenetrability! That's what I say!'
`Would you tell me please,' said Alice, `what that means?'
I have submitted a GGUS ticket - these things won't improve unless they are complained about: https://savannah.cern.ch/bugs/index.php?25048.