Monday, July 23, 2007

Job submission with DIRAC

In order to get some "real user" experience of performing physics analysis on the Grid, I have been doing a lot of reading and playing with the LHCb computing software. First of all, there's a lot of it so it takes a while to get to understand what each component does, how they can be linked together, how they are configured and built and how the applications can be run locally or on the Grid to do some real physics.

I was particularly interested in getting some basic jobs running on the Grid, so I quickly started playing with Ganga, the user interface for job configuration and submission. At first I was quite impressed. It was very simple to use Ganga to submit small jobs to the local system, CERN batch or the Grid (via the LHCb DIRAC workload management system). However, a few problems quickly appeared:

1. Jobs were continually failing on the Grid due to poorly configured software installations on the sites. Missing libraries was the main source of problems. It also seems that the latest version of Gauss v30r3 (LHCb MC generation) is a bit broken due to a mis-configured path. These things weren't a problem with Ganga as such, but using it meant that another layer had to potentially be debugged.

2. I found bulk job submission was very difficult in Ganga. Writing the python code to loop over the jobs is easy, but the client just couldn't handle the 100's of jobs going through it. It became very slow and eventually just hung. Even just starting up the client is slow. Maybe running on a non-lxplus machine would be better. There were also inconsistencies between the Ganga job monitoring and that reported by DIRAC.

As an alternative, I decided to bypass Ganga and use the DIRAC API directly. This proved to be quite successful, being much faster for bulk submission. I put together some notes on this, which can be found here:

Using DIRAC didn't help with the site mis-configurations (although it is easy to get the job output and check the log files for problems), but I found it a more efficient way of working. I'll try again with Ganga once I understand better the problems that keep on appearing on the Grid.

From my brief foray into running jobs on the Grid, it appears that Ganga/DIRAC do insulate users from malfunctioning middleware, however, there are still real problems when it comes to poorly installed software on the sites. From a deployment point of view, maybe this should be taken as encouragement, as the problem is at the application level and not so much with the middleware. I think we would need to do a more systematic study to find this out (much like Steve's ATLAS jobs).

What is needed is better testing of the sites by through VO-specific SAM tests. This information then has to be fed back into DIRAC (or whatever) so that mis-configured sites can be ignored until their problems are resolved. User will then find running jobs on the Grid a much easier and pleasant experience.

No comments: