Wednesday, January 10, 2007

Happy New Year ScotGrid!

Despite running unattended over the Christmas holidays, ScotGrid sites were by-and-large available.

From 2006-12-20 to 2007-01-03 SAM status for each site was:

ScotGRID-Edinburgh: SAM test pass rate 241/338 = 71.3%
(Failures: JS 3, RM 94)
scotgrid-gla: SAM test pass rate 324/338 = 95.8%
(Failures: JS 4, RM 10)
UKI-SCOTGRID-DURHAM: SAM test pass rate 318/338 = 94.1%
(Failures: JS 6, RM 14)
UKI-SCOTGRID-GLA: SAM test pass rate 328/338 = 97.0%
(Failures: JS 1; RM 9)

Looking at the JS and RM failures for everyone except Edinburgh these seem to be at the level of grid flakiness caused by higher level services failing or timing out (e.g., RM test times out, BDII fails to respond, etc.). So our only significant problem was issues with Edinburgh's dCache, which did seem to be unreliable.

Any comments Greig? Did you have to intervene?


Greig A Cowan said...

Our dCache is still suffering from the tcp CLOSE_WAIT issue (in fact, it is slightly different to the problem I previously solved). If our dCache had been quiet over the period, then this wouldn't have been a problem and we would have continued to pass the SFTs, but (and this is a good thing) atlas has been constantly transferring data into and out of the site over the past 3-4 weeks. Looking at the storage accounting pages, they have copied ~1.2TB into our dCache over the period. All this activity resulted in the tcp CLOSE_WAIT problem reappearing (see attachment). Once the number of CLOSE_WAITs gets to a certain level (~300), the gridftp doors crash and no further transfers can take place, meaning that we start failing SFTs. The only way this problem can be resolved is via a manual restart of the gridftp doors (which I did a few times during the holidays).

The dCache developers know that we are suffering from this problem and I intend on raising the issue again at next weeks dCache workshop. Hopefully we can get some answers to help us and Lancaster.

Graeme Stewart said...

Good to know that at we have an identifiable problem with dCache which is being looked into. Much better than not knowing why things were going wrong.