We, and a large fraction of the rest of the grid, started to fail replica management tests late on Friday night. At first I thought it must be a catalog problem at cern, so I raised a ticket. However, it turned out that what had actually happened was that the VOMS role used to submit the SAM tests had changed. This caused DPM to map the SAM tester DN into a different group - who then did not have permission to write into the default generated directory for lcg-cr.
This change was made completely unannounced, and I suspect without any real thought as to the implications for sites using DPM 1.6.3 and earlier.
Maarten Litmath helpfully posted a fix-up script on LCG-ROLLOUT, which uses ACLs to grant suitable privileges to lcgadmin and production roles for each supported VO, which I applied to Glasgow and Durham at about midnight last night (I was surely violating cardinal rules of sysadmining, but I couldn't see how it would cause harm - and this time I got away with it). This did fix the problem.
I'm really annoyed that this though. Changes like this should never, ever be made on a Friday! (It seemed the change actually came through at ~10am, but didn't break until midnight, when the next YYYY-MM-DD directory needed to be created.) In addition several people have commented that the fix is to upgrade to DPM 1.6.4 - despite the fact that this is broken in gLite 3.0r25 in two significant ways!
Grrrr. I just hope they don't ask us to explain these SFT failures - they shall have a piece of my mind... (I sound just like my Mum, when she was annoyed - see what the grid's doing to me!).
No comments:
Post a Comment