Friday, June 13, 2008

Problems up north

We have two major problems in ScotGrid right now:

ECDF: Have been failing SAM tests for over a week now. The symptom is that the SAM test is submitted successfully, runs correctly on the worker node, but then job outputs never seem to get back to the WMS, so eventually the job is timed out as a JS failure. As usual we cannot reproduce the problem with dteam or ATLAS jobs (in fact ATLAS condor jobs are running fine) so we are hugely puzzled. Launching a maual SAM test throught the CIC portal doesn't help because the test gets into the same state and hangs for 6 hours - so you cannot submit another one. Sam has asked for more network ports to be opened to have a larger globus port range, but the network people in Edinburgh seem to be really slow in doing this (and it seems it is not the root cause anyway).

Durham: Have suffered a serious pair of problems on their two SE hosts. The RAID filasystem on the headnode (gallows) was lost last week and all the data is gone. Then this week the large se01 disk server suffered an LVM problem and we can no longer mount grid home areas or access data on the SRM. Unfortunately Phil is on holiday, David is now off sick and I will be away on Monday - hopefully we can cobble something together to get the site running on Tuesday.

Thankfully, dear old Glasgow T2 is running like a charm right now (minor info publishing and WMS problems aside). In fact our SAM status for the last month is 100%, head to head with the T1! Fingers crossed we keep it up.

