Monday, June 22, 2009

Bright and creamy MPI

So, as of the last time MPI was mentioned, it was working. Well, it looks like it wasn't getting much use, because over the year or so, it seems to have fallen into disrepair.

We'd ended up with MPIexec not being installed on the worker nodes, which was blocking the setup of the processor nodes. This even prevented a single process MPI job from running, because that still used MPIexec. In the end, this particular problem was resolved by installing it again (after some careful ramp up to make sure it didn't knock anything else off).

The phrasing of that last sentence is deliberately precise: it turned out that there was another problem lurking in the swamp water that is middleware. In order to test the install of MPIexec, I grabbed a worker node that was out of production for the HEP-SPEC benchmarking. This had, of course, got a new install of the worker node packages, in order to give a consistent platform with other sites.

Experienced Grid hands might just be able to predict what comes next...

After installing MPIexec on that node, and then restricting that node to just our test VO (Maui is awesome for this sort of tweak), we noticed that it wasn't accepting any jobs. Specificially, jobs were arriving, but failing immediatly. Cue finger pointing at MPIexec, and removal of it.

Didn't help.

In the end, Mike resolved this one: An incompatabiliy between the Torque server, and the torque clients with the new Worker node package. Once that was resolved, MPIexec back on, and it was all working fine. Roll out across the cluster, finger crossing and no problems: MPI back in business.

The next step was to actually run MPI jobs - took a couple of attempts with mpi-start, but got there. One problem we have is that the WMS will not send MPI jobs to a site that declares that it is 'torque'. It will only send jobs to sites that declare that declare the LRMS to be 'pbs' or 'lsf'. Given that torque identical to PBS (and more common!), that's a bit silly. This is a known bug, that's been open for 4 years, with a patch available, this is a bit rediculous.

There is a work around, where you can tell the WMS to use a specific LRMS, but you have to also specifiy the target CE - which kind of defeats much of the point of the WMS...

Fortunatly, using the CREAM CE sidesteps most of these issues. Alas, the latest WMS package doesn't work properly with CREAM CE's, so we had to mark our CREAM CE to be 'Special', not 'Production' (effectivly disableing WMS submission to CREAM). Not too big a problem, as we can do direct submission to the CREAM CE for our specific use case, but it's not great in the long term.

Our specific use case is the Lumerical FDTD package, which is installed and working at Glasgow, and has been used by end users. There's some trickyness involed in this, as we're not passing in source code, as mpi-start expects, so I'll write up a bit more how it all fits together at some point.

There might be some Maui fiddling in the imminent future, to assist it to pack MPI jobs on to as few physical machines as possible. The key point is that MPI has been used by end users at Glasgow, which bodes well.

No comments: