We have tried pretty much everything, as simple lcg-ls and lcg-cp actually work from the worker nodes so its not a certificate issue. The failures are not particular to a CE. Nothing changed at our site prior to the failure and LHCb say nothing changed at their end. In fact they have sites in the UK such as Manchester working fine.
None of the failures correspond to a particular set of worker nodes which might indicate NAT issues for us as we split our odd and even nodes through separate NAT's. However, it does look like network contention at some point in the process as we see either broken pipes or timeouts in the logs direct from Globus.
2010-03-04 04:04:56 UTC dirac-jobexec.py ERROR: SRM2Storage.__putFile: Failed to put file to storage. file:/tmp/8230840/CREAM603030715/7472318/00005987_00009161_3.dst: globus_xio: System error in writev: Broken pipe
2010-03-04 04:04:56 UTC dirac-jobexec.py ERROR: globus_xio: A system call failed: Broken pipe
The only constant so far is that their appears to be a 50% failure rate from failed uploads which happens consistently from submissions from DIRAC.
Its certainly a puzzler and we are fast running out of ideas!