Wednesday, September 20, 2006

Edinburgh have been failing the replica management SFTs for over a week now. Initially, only one of the RM sub-tests (CE-sft-lcg-rm-rep) was failing, due to a server timeout problem (which I am sure was not a problem at our end since the dCache has been used almost continuously for the ongoing inter-T2 transfer tests). However, on Monday afternoon we started failing 4 of the sub-tests, and were getting an error that indicated a permissions problem with the /pnfs/epcc.ed.ac.uk/data/ops directory on our dCache (the SFTs are now running as ops). At the same time, the ops SAM tests started to fail with a similar error. Meantime, the dteam SAM tests were all green.

In order to try and work out what the problem was, I used /opt/edg/etc/grid-mapfile-local and mapped my DN to the ops VO. I could then use srmcp to copy files into and out of the ops directory of dCache. There were some problems in using the lcg-cr command, but it was unclear if this was to so with me trying to interact with the ops file catalog when my DN would map me to dteam. I also changed the dCache configuration to something more basic, just to check that this was not causing some problem, but this did not have an impact on the SFT results. Note, can be very tricky trying to debug a problem with a VO that you are not a member of.

However, about 2230 last night, the SFTs switched back to only failing on the single RM sub-test and at about 0100, the ops SAM tests started passing again. Strange, I know. Checking LCG-ROLLOUT this morning, there had already been a few postings about the RM tests failing with dCache's at other sites. It appeared that the cause of this was that Judit Novak (who helps run the SFTs) had now joined the ops VO, but her DN was still being mapped to dteam within the grid-mapfile. She has now unregistered from ops and has stopped (I think) the SFTs for today to ensure that the grid-mapfiles are up to date.

I'll update tomorrow if I see that things have changed.

No comments: