Tuesday, May 25, 2010

So long and thanks for all the fish

I would just like to say thanks to everyone who I have worked with at ScotGrid, GridPP and EGEE. I couldn't have picked a better time to be working on grid, LCG and WLCG. I have learned a lot, accomplished most of the things I set out to do and hopefully contributed to the project in some small way. I will always be on the other end of an email should you wish to get in touch. So long and thanks for all the fish.

Thursday, May 20, 2010

gLite Virtual Box Image takes off

A new user today was looking for the download of our pre-built UI Virtual box image for usage with two VOS: vo.iscpif.fr and vo.complex-systems.eu

The ISC-PIF (Institut des Systèmes Complexes, Paris Île-de-France) is a multidisciplinary research and training center promoting the development of French, European and international strategic projects on complex adaptive systems, construed as large networks of elements interacting locally and creating macroscopic collective behaviour.

Hopefully we will get some feedback on the image and any improvements that could be made.

Wednesday, May 19, 2010

SGE and Lustre

On my list of things to do was install (Sun/Oracle) Grid Engine and get a CREAM CE submitting to it on my development cluster. So far I have SGE installed and running qsub jobs. I am documenting the experience for those who are interested here. I have opted for Lustre rather than NFS 3 as it is painfully ill-equipped for the task and we have a test Lustre instance to play with so why not go the whole hog.

the positives ...
1. The wealth of documentation on the Oracle page.
2. The interactive install is very easy to do.

and the negatives ...
1. The rpms default install location is /gridware & I can't seem to get my yum repo to use a --prefix like option. Something you can do with rpm. Ideas welcome?
2. The automatic install scripts having no debugging on them at all. When they fail they just fail silently with no output or logs. I have only managed an interactive install so far but I will try to set /bin/sh -x and see if it makes a difference. Hopefully I can get the automatic script running so that cfengine can deal with the install of the execution hosts rather than by hand.

Friday, May 14, 2010

The return of LHCb at Glasgow

After weeks of investigating and debugging our LHCb transfer issue at Glasgow we have finally fixed it. So ..... spill the beans I hear you cry.

Well in short, we had an iptables rule on the INPUT filter of the NAT that was dropping strangely behaving gridftp connections. This was relaxed and allowed inbound connections to be established. This has solved the issue and we still have the protection of the campus firewall for security.

Strangely behaving gridftp connections, what does that mean? Well, transfers that had failed to work first time seemed to get into an unknown state and transfer no bytes, with many RETRY packets and no FIN packet. It appears that these connections were trying to establish inbound connections. These were then dropped by a REJECT within our iptables.

Moral of the story is, if you can get external IP's for your worker nodes, use them. NAT'ing just adds complexity especially when dealing with GLOBUS.

The full story if you are interested ....

Problem: LHCb don't use FTS. They use direct outbound gridftp transfers of job outputs. Jobs on WN's transfer results, using the lcg-utils tools, at the end of the job to CERN and failover to various T1's if there is an issue with the CERN transfer. LHCb have seen a large failure rate with around 50% of gridftp/lcg-cp transfers failing at Glasgow. Brunel, Sheffield and Lancaster have been affected with the same issue although to a lesser extent. Failure rates at the other sites are much less at around 2-3%. We see the initial transfer timing out, failing over to a T1, this sometimes works and sometimes fails over to another T1 and so on. Why has this not been seen sooner? Well this has actually been there since day dot but DIRAC masked the return code of the failure. A new version of DIRAC catches the fail-overs and are killed by their watchdog. Thus bringing this issue to the surface.

Investigation: Glasgow looks like this WN's-> NAT->CAMPUS FIREWALL->WORLD. We managed to recreate the issue with a simple transfer test from varying amounts of WN's to test SRM end-points. This recreated the issue and we saw a 50% failure rate across various SRM implementations, in particular CASTOR, DCACHE, STORM. However, DPM transfers were 100% successful. Failed transfers manifested themselves are lcg-cp: timed out or lcg-cp: error on send. We repeated these tests using various VO's and got similar results so we did not think it was VO related. We monitored the connections though our NAT and asked the firewall team to check if any outbound ports were blocked, they were not. The GLOBUS_TCP_PORT_RANGE at Glasgow was set to a specific known open port range for inbound connections but this does not matter in this case of outbound connections. To be on the safe side we set GLOBUS_TCP_SOURCE_RANGE for outbound connections through our NAT. As we expected this did not make a difference. After discussion with other sites we checked client libraries, OS and network. One thing that did crop up was the use of NAT.

The final test was 100 simultaneous transfers from one node via the NAT. We saw a 50% failure rate. We repeated this test but this time with an external address and no NAT routing. This was 100% successful over 3 attempts. Quickly repeated tests did show some failures but this was probably the firewall dropping connections. Therefore, we were able to clearly identify the NAT as being the issue. We tried tweaking TCP settings on the NAT i.e. tcp_fin_timeout, tcp_tw_reuse, tcp_tw_recycle, tcp_keepalive_time with no success. The iptables rules themselves seemed sensible but we were still dropping 50% of the connections.

We then moved to tcpdumping the tcp packets (SYN and FIN) from the internal (eth0) device and compared it to a tcpdump of the external (eth1) device. You could clearly see the control channels opening, data channels opening, transfers and then around 50% of the transfers sending retry packets and never sending a FIN. It looked like something was being blocked.

A closer look at the iptables rules identified an entry on the INPUT filter that could be the culprit. Further up the chain we were allowing RELATED,ESTABLISHED as you would expect. Then we had a -A INPUT -i eth1 -p tcp -m tcp -j REJECT --reject-with tcp-reset. It appears this entry caused attempts to re-establish the connection to fail (possibly by blocking the initial packet from the destination, erroneously considering it not to count as ESTABLISHED any more). Very strange behaviour indeed. In the plus side we generally use the campus firewall to protect us from unwanted traffic rather than our own iptables rules, so we have relaxed the INPUT filter and guess what, near 100% transfer success.

Thursday, May 13, 2010

A fistful of user jobs...

No ATLAS production to do in the UK, but we have a nice full cluster anyway, with more than 1000 user jobs running:


svr016:~# qstat -q

server: svr016.gla.scotgrid.ac.uk

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
q2d                --   48:00:00 48:00:00   --  134  12 --   E R
atlanaly           --   24:00:00 24:00:00   --  738 789 --   E R
atlprd             --   48:00:00 48:00:00   --    2  48 --   E R
q7d                --   168:00:0 168:00:0   --    0   0 --   E R
route2all          --      --       --      --    0   0 --   E R
q1d                --   24:00:00 24:00:00   --   94  16 --   E R
mpi                --      --    72:00:00   --    0   0 --   E R
atlas              --   24:00:00 24:00:00   --  470 312 --   E R
lhcb               --   48:00:00 48:00:00   --    0   0 --   E R
                                               ----- -----
                                                1438  1177

The atlanaly queue are jobs from the panda backend and the atlas queue takes WMS backend jobs.

Today the particular job mix was kind to the storage, with no overloads being seen, but it's something we constantly have to monitor to pre-empt problems.

Postscript: I had another look and realised that most of the WMS backend jobs were from hammercloud (Sam testing SSDs!). Seems that genuine user WMS jobs were about 20-30, with more than 1000 in the panda backend.

Wednesday, May 05, 2010

CREAM thickens

CREAM at Glasgow has been upgraded to the latest glite3.2 release 3.2.5-0.sl5 (or INFN version 1.6). This brings lots of enhancements like

self limiting behaviour a'la WMS
A new proxy purger to clean the delegationdb and from the file system the expired proxies
a new way to customize the job wrapper
an improved proxy renewal mechanism
and one of my favourites, support for ISB/OSB transfers from/to gridftp servers run using user credentials rather than server host certificates. This will work well with users running gridftp servers on their own machines for example (as long as they don't turn them off when they go home at night!)

you can find out more about them all here. Now back to draining the WMS.