Edinburgh has been suffering lately from a number of failures in the SAM replica management tests. Looking at the dCache webpages reveals the reason:
http://srm.epcc.ed.ac.uk:2288/usageInfo
Rather than displaying the usual pool usage information, the table contained entries like this:
pool1_04 pool1Domain [99] Repository got lost
The "Repository got lost" error can be explained as follows. The dCache periodically runs a background process which attempts to write a small test file onto each pool in an attempt to check that it is still operational. If this fails then the above error message will be generated. According to the dCache developers this will occur if there is a filesystem problem or a disk is not responding quickly enough.
What is strange, however, is that the recent problems have resulted in all of our dCache pools as being marked with the above error. It seems strange that the same filesystem or disk issue would simultaneously affect all of the pools. I have submitted a ticket to dCache support in an attempt to get more information.
The problem can eaily be fixed by restarting the pool process on the affected dCache pool node. In the above case it is pool1.epcc.ed.ac.uk since the pool in question is named pool1_04.
service dcache-pool restart
I have started to add material to this page
http://www.gridpp.ac.uk/wiki/Edinburgh_dCache_troubleshooting
to desribe our dCache setup in more detail. This should give people a better idea of how a working system should be configured. I will add more information when I have time.
1 comment:
More information can be found here:
http://gridpp-storage.blogspot.com/2007/04/repository-got-lost.html
Post a Comment