Monday, October 22, 2007

EGEE: It's broken but it works...

I had meant to blog this a week ago, but I did my first shift for ATLAS EGEE production. This was really being thrown in at the deep end, as the twiki instructions were, well spartan, to say the least. (And definately misleading in places, actually.)

I was on Data Management, which meant concentrating on problems of data stage-in and stage-out, as well as trying to pick up site which had broken tool sets.

It felt like a bit of a wild ride - there are almost always problems of some kind, and part of the art is clearly sorting out the completely urgent must be delt with now, from the simply urgent, down to the deal with this in a quiet moment.

I found problems at T1s (stage-in, overloaded dCaches, flaky LFCs), T2s (SEs down, quite a few broken lcg-utils, some sites just generically not working but giving very strange errors). I raised a lot number of GGUS tickets, but sometimes it's very difficult to know what the underlying problem is, and it's very time consuming batting the ticket back and forth with the site.

It's a very different experience from being on the site side. Instead of a "deep" view of a few sites you have a "shallow" view of almost all of them. If you want to read my round-up of issues though the week, it's on indico (it's the DDM shifter report).

No comments: