Since the last time we mentioned LHCb, we thought we had the problem licked.
Sadly, we were mistaken.
Like a Matryoshka doll, inside the first problem we found lurked another. This one was more widespread, however.
Although we'd fixed the problem of failing jobs, during the course of each job there were a noticeable number of transfer failures. That is, the job first attempted to send the data back to CERN, then if that failed, tried a number of other places until it eventually worked. Notably transfers to PIC always seemed to work fine.
During some other work involving ARC, I ended up tuning the TCP stack parameters on a service node, and noticed that we were using the default parameters on our worker nodes. This lead down a rabbit hole, till eventually finding a solution.
The first idea was to tune the worker nodes for transfers to CERN, to see if making the transfers faster made more complete in time (and thus fewer failures). Some tinkering suggested that the values that YAIM puts on a DPM pool node were decent choices, so slapped them in cfengine, and away we went.
Working out what was happening took a bit longer, and was down to Rob Fay at Liverpool.
Part of the tuning that YAIM does is to turn of SACK and DSACK. The other parts, about adjusting initial buffer sizes turned out not to be relevant here. So why was SACK causing problems, and why was YAIM switching it off for the DPM pool nodes?
Well, there's a bug in Linux contrac module that thinks that SACK packets are invalid, and thus won't forward them. If it's the recipient of the packets, it's all fine, but the forwarding code was fixed in 2.6.26 (2 years ago!), and before that it would reject the SACK packets, which caused the connection to eventually revert to conventional ACKs. SL5.3 uses a 2.6.18 kernel.
As to why YAIM turns it off for DPM pool nodes; apparently because that's what YAIM did for CASTOR pool nodes at the time the YAIM module was written. (It doesn't today). This also explains why the transfers to PIC always worked - SACK needs both sides to agree to use it, and PIC uses a DPM (hence no SACK).
So, upshot of all of that is that transferring from worker nodes to a storage element (that's not DPM) going through a NAT will be hit by this bug, crippling performance.
Solutions to this are, in rough order of preference:
1. Always transfer to local storage and stage on from there.
2. Don't use NATs.
3. If you have to transfer to remote storage, and have to use a NAT, turn off SACK and DSACK.