Wednesday, April 11, 2007

Edinburgh storage woes

Edinburgh has been suffering lately from a number of failures in the SAM replica management tests. Looking at the dCache webpages reveals the reason:

http://srm.epcc.ed.ac.uk:2288/usageInfo

Rather than displaying the usual pool usage information, the table contained entries like this:

pool1_04 pool1Domain [99] Repository got lost

The "Repository got lost" error can be explained as follows. The dCache periodically runs a background process which attempts to write a small test file onto each pool in an attempt to check that it is still operational. If this fails then the above error message will be generated. According to the dCache developers this will occur if there is a filesystem problem or a disk is not responding quickly enough.

What is strange, however, is that the recent problems have resulted in all of our dCache pools as being marked with the above error. It seems strange that the same filesystem or disk issue would simultaneously affect all of the pools. I have submitted a ticket to dCache support in an attempt to get more information.

The problem can eaily be fixed by restarting the pool process on the affected dCache pool node. In the above case it is pool1.epcc.ed.ac.uk since the pool in question is named pool1_04.

service dcache-pool restart

I have started to add material to this page

http://www.gridpp.ac.uk/wiki/Edinburgh_dCache_troubleshooting

to desribe our dCache setup in more detail. This should give people a better idea of how a working system should be configured. I will add more information when I have time.