Friday, March 23, 2007
Gridview and Site Reliability
The reliability of Tier-2 sites was a major topic at GridPP 18. This is going to be measured in an automated way from the SAM tests. I was very surprised, though, when I compared the SAM CE tests (the test I look at every dayt) against gridview. I measured our test pass rate over the last 3 weeks to be 93% (Durham), 96% (Edinburgh) and 97% (Glasgow). However, GridView measured us down as low as 80% (see slide 8 of my talk).
John Gordon had given the availability formula in his talk: and it's CE & SE & Site BDII & SRM. So I had to check these tests as well. What I discovered was that it was failures in SE and SRM which were pulling us down (site BDII tests were 100%). However, investigating these further I discovered that it was in fact BDII failures in SRM and SE tests which were the problem. More that that, it was not failures in the site designated BDII, but that the BDII had been hardcoded to sam-bdii.cern.ch - and all the failures were from this component. It's used by both those tests.
I have raised a GGUS ticket asking to get this changed to the site defined BDII. There is absolutely nothing a site can do about the failure of a central component at CERN.
If one estimates a failure rate on information system lookups during replica management tests of ~2%, the fact that the information system is used for CE-RM, SE and SRM, i.e., 3 times, means a site just cannot get any better than 94%!
Clearly this identifies the information system as a major problematic component which needs to be addressed if we have a hope of reaching our 95% targets. See Laurence Field's GDB talk for some ideas in this area - retries and caching look to be essential.
I have one other serious issue, and one quibble, with gridview. The serious thing is the 10% quantisation on the plots, which is a nonsense when our target is 95%. The quibble is the stupid mapping of the GOC sitename to a quite unguessable 6 letter abbreviation (old scotgrid-gla site was "GLSGW", the new site is "SCOTG"). Clearly they should learn from the EGEE accounting pages and give us a heirarchical tree view with the correct site names.