Friday, May 14, 2010

The return of LHCb at Glasgow

After weeks of investigating and debugging our LHCb transfer issue at Glasgow we have finally fixed it. So ..... spill the beans I hear you cry.

Well in short, we had an iptables rule on the INPUT filter of the NAT that was dropping strangely behaving gridftp connections. This was relaxed and allowed inbound connections to be established. This has solved the issue and we still have the protection of the campus firewall for security.

Strangely behaving gridftp connections, what does that mean? Well, transfers that had failed to work first time seemed to get into an unknown state and transfer no bytes, with many RETRY packets and no FIN packet. It appears that these connections were trying to establish inbound connections. These were then dropped by a REJECT within our iptables.

Moral of the story is, if you can get external IP's for your worker nodes, use them. NAT'ing just adds complexity especially when dealing with GLOBUS.

The full story if you are interested ....

Problem: LHCb don't use FTS. They use direct outbound gridftp transfers of job outputs. Jobs on WN's transfer results, using the lcg-utils tools, at the end of the job to CERN and failover to various T1's if there is an issue with the CERN transfer. LHCb have seen a large failure rate with around 50% of gridftp/lcg-cp transfers failing at Glasgow. Brunel, Sheffield and Lancaster have been affected with the same issue although to a lesser extent. Failure rates at the other sites are much less at around 2-3%. We see the initial transfer timing out, failing over to a T1, this sometimes works and sometimes fails over to another T1 and so on. Why has this not been seen sooner? Well this has actually been there since day dot but DIRAC masked the return code of the failure. A new version of DIRAC catches the fail-overs and are killed by their watchdog. Thus bringing this issue to the surface.

Investigation: Glasgow looks like this WN's-> NAT->CAMPUS FIREWALL->WORLD. We managed to recreate the issue with a simple transfer test from varying amounts of WN's to test SRM end-points. This recreated the issue and we saw a 50% failure rate across various SRM implementations, in particular CASTOR, DCACHE, STORM. However, DPM transfers were 100% successful. Failed transfers manifested themselves are lcg-cp: timed out or lcg-cp: error on send. We repeated these tests using various VO's and got similar results so we did not think it was VO related. We monitored the connections though our NAT and asked the firewall team to check if any outbound ports were blocked, they were not. The GLOBUS_TCP_PORT_RANGE at Glasgow was set to a specific known open port range for inbound connections but this does not matter in this case of outbound connections. To be on the safe side we set GLOBUS_TCP_SOURCE_RANGE for outbound connections through our NAT. As we expected this did not make a difference. After discussion with other sites we checked client libraries, OS and network. One thing that did crop up was the use of NAT.

The final test was 100 simultaneous transfers from one node via the NAT. We saw a 50% failure rate. We repeated this test but this time with an external address and no NAT routing. This was 100% successful over 3 attempts. Quickly repeated tests did show some failures but this was probably the firewall dropping connections. Therefore, we were able to clearly identify the NAT as being the issue. We tried tweaking TCP settings on the NAT i.e. tcp_fin_timeout, tcp_tw_reuse, tcp_tw_recycle, tcp_keepalive_time with no success. The iptables rules themselves seemed sensible but we were still dropping 50% of the connections.

We then moved to tcpdumping the tcp packets (SYN and FIN) from the internal (eth0) device and compared it to a tcpdump of the external (eth1) device. You could clearly see the control channels opening, data channels opening, transfers and then around 50% of the transfers sending retry packets and never sending a FIN. It looked like something was being blocked.

A closer look at the iptables rules identified an entry on the INPUT filter that could be the culprit. Further up the chain we were allowing RELATED,ESTABLISHED as you would expect. Then we had a -A INPUT -i eth1 -p tcp -m tcp -j REJECT --reject-with tcp-reset. It appears this entry caused attempts to re-establish the connection to fail (possibly by blocking the initial packet from the destination, erroneously considering it not to count as ESTABLISHED any more). Very strange behaviour indeed. In the plus side we generally use the campus firewall to protect us from unwanted traffic rather than our own iptables rules, so we have relaxed the INPUT filter and guess what, near 100% transfer success.

No comments: