Friday, March 12, 2010

LHCb Production Failures

Over the last week we have been investigating why we have around 50% failure rate with LHCb jobs. All seem to be failing with the same issue which is sometimes not being able to copy their results back to the Tier 0 or subsequent fail-over Tier 1 site. This is not strictly just a Glasgow issue and it has affected Sheffield and Brunel, although the issue appears to have gone away from Brunel.

We have tried pretty much everything, as simple lcg-ls and lcg-cp actually work from the worker nodes so its not a certificate issue. The failures are not particular to a CE. Nothing changed at our site prior to the failure and LHCb say nothing changed at their end. In fact they have sites in the UK such as Manchester working fine.

None of the failures correspond to a particular set of worker nodes which might indicate NAT issues for us as we split our odd and even nodes through separate NAT's. However, it does look like network contention at some point in the process as we see either broken pipes or timeouts in the logs direct from Globus.


2010-03-04 04:04:56 UTC dirac-jobexec.py ERROR: SRM2Storage.__putFile: Failed to put file to storage. file:/tmp/8230840/CREAM603030715/7472318/00005987_00009161_3.dst: globus_xio: System error in writev: Broken pipe
2010-03-04 04:04:56 UTC dirac-jobexec.py ERROR: globus_xio: A system call failed: Broken pipe


The only constant so far is that their appears to be a 50% failure rate from failed uploads which happens consistently from submissions from DIRAC.

Its certainly a puzzler and we are fast running out of ideas!

No comments: