Friday, July 06, 2007

Success with MPI

Cracked it! I can now get MPI jobs running on the Glasgow cluster.

First thing to note is that the gatekeeper does not invoke mpirun for the job - this is very good, because it would be almost impossible to get this to work if it did.

The key file is the NODELIST file which the CE will generate and add to the executable's argument list. When given as the argument of the -p4pg option then mpirun will ssh to all of the "slave" nodes and start the binary which is given in the NODELIST file.

By default this breaks for 2 reasons:
  1. The gatekeeper only copies the job's sandbox into the working directory of the "master" worker node. So on the "slave" nodes the executable isn't present. (N.B. Even though we have a shared data area for our glaNNN accounts, the working directory is always in /tmp and local to the worker node.)
  2. The executable listed really needs to be a wrapper script, so it's the wrong thing for mpirun to be starting anyway.
So, wrapper script really has to do the following:
  1. Change to a more sensible shared directory (like $CLUSTER_SHARED).
  2. Rewrite the NODELIST file so that the name of the correct mpi binary to run is given, instead of the wrapper script itself.
  3. Invoke mpirun, giving the new NODELIST file.
Here's an example (with a lot of debugging hooks) which works, running the code:
#! /bin/sh
#
# Argument list is: BINARY -p4pg NODELIST -p4wd PATH
# What's really important for us is the NODELIST file, i.e., $3
cd $CLUSTER_SHARED/mpi
export MYBIN=$1
PGFILE=`pwd`/pgfile.`hostname -s`.$$
echo My Args: $@
echo "----"
echo "Original NODELIST file:
cat $3
echo "----"
cat $3 | perl -ne 'print "$1 $2 /cluster/share/gla012/mpi/$ENV{\"MYBIN\"}\n" if /^([\w\.]+)\s+(\d+)/;' > $PGFILE
echo "----"
echo "New NODELIST file:
cat $PGFILE
echo "----"
/opt/mpich-1.2.7p1/bin/mpirun $MYBIN -p4pg $PGFILE
(Hmm, it's splitting that perl one liner in a really nasty way - no line breaks there.)

There are, however, two problems which I can see.
  1. Accounting. Looking at the torque logs it's clear that only the master node's process is being accounted for. The slave node MPI processes are not accounted for. Do we multiply the master node's CPU and Wall by the node number as an interim measure?
  2. Orphaned and stray processes. As ssh is used to start the binary on the slave nodes, what happens if the code leaves them behind or they run away?
I wonder if there's a way we can modify mpirun to do things in a torque friendly way? I shall enquire of the MPI gurus.

(For more formal documentation, watch this wiki page....)

No comments: