Friday, January 12, 2007

End of the week - more Transfers tests (and a quick Perl script to summarise logs). Colin's tweaks have vastly improved inbound speeds to UKI-SCOTGRID-GLASGOW from Edinburgh to over 600Mb/s. "Nice!"

Having problems transferring to/from RAL-T1 though. Suspect it's more likely to be typos in SRM URL's rather than system failures. Symptoms are

Reason: Failed on SRM put: Cannot Contact SRM Service. Error in srm__ping: SOAP-ENV:Client - CGSI-gSOAP: Error reading token data: Success; also failing to do 'advisoryDelete' on target.

Plan for next week - More of the same - check that size isn't an issue (hope not!)

Wednesday, January 10, 2007

Andrew has been pushing on with file transfer tests using the new site. The good news is that initial indicators are that we seem to be able to get much better rates: up to 650Mb/s incoming from Edinburgh.

However, he managed to provoke an MCE error twice on disk041 - so this server definately has bad memory.

Fortunately dpm-drain decided to start working (last time I tried to get data off this machine it failed with various bizarre errors) so it was very simple to move the 14GB of files off and remove this partition from the DPM. (Most of the data was dteam test dat, so fingers crossed there's been no corruption of the small number of pheno files.)

ClusterVision have been informed and David is working on a reliable way of triggering these errors so that we can test machines before they go back into production.
After updating to the latest gLite release (r11) on the old scotgrid site, the site started to fail JS with the much maligned message "Unspecified job manager error".

It turned out to be the classic WN cannot ssh back to CE to transfer job output. After a bit of poking around and playing I decided that for a site which has just 48 hours of run time left before it gets shut down it was just not worth the hassle of trying to debug this - so I have closed the queues and put the site into downtime, where it can languish until it is withdrawn on Friday.

Farewell, scotgrid-gla!
Happy New Year ScotGrid!

Despite running unattended over the Christmas holidays, ScotGrid sites were by-and-large available.

From 2006-12-20 to 2007-01-03 SAM status for each site was:

ScotGRID-Edinburgh: SAM test pass rate 241/338 = 71.3%
(Failures: JS 3, RM 94)
scotgrid-gla: SAM test pass rate 324/338 = 95.8%
(Failures: JS 4, RM 10)
UKI-SCOTGRID-DURHAM: SAM test pass rate 318/338 = 94.1%
(Failures: JS 6, RM 14)
UKI-SCOTGRID-GLA: SAM test pass rate 328/338 = 97.0%
(Failures: JS 1; RM 9)


Looking at the JS and RM failures for everyone except Edinburgh these seem to be at the level of grid flakiness caused by higher level services failing or timing out (e.g., RM test times out, BDII fails to respond, etc.). So our only significant problem was issues with Edinburgh's dCache, which did seem to be unreliable.

Any comments Greig? Did you have to intervene?