First thing to note is that the gatekeeper does not invoke mpirun for the job - this is very good, because it would be almost impossible to get this to work if it did.
The key file is the NODELIST file which the CE will generate and add to the executable's argument list. When given as the argument of the -p4pg option then mpirun will ssh to all of the "slave" nodes and start the binary which is given in the NODELIST file.
By default this breaks for 2 reasons:
- The gatekeeper only copies the job's sandbox into the working directory of the "master" worker node. So on the "slave" nodes the executable isn't present. (N.B. Even though we have a shared data area for our glaNNN accounts, the working directory is always in /tmp and local to the worker node.)
- The executable listed really needs to be a wrapper script, so it's the wrong thing for mpirun to be starting anyway.
- Change to a more sensible shared directory (like $CLUSTER_SHARED).
- Rewrite the NODELIST file so that the name of the correct mpi binary to run is given, instead of the wrapper script itself.
- Invoke mpirun, giving the new NODELIST file.
#! /bin/sh(Hmm, it's splitting that perl one liner in a really nasty way - no line breaks there.)
#
# Argument list is: BINARY -p4pg NODELIST -p4wd PATH
# What's really important for us is the NODELIST file, i.e., $3
cd $CLUSTER_SHARED/mpi
export MYBIN=$1
PGFILE=`pwd`/pgfile.`hostname -s`.$$
echo My Args: $@
echo "----"
echo "Original NODELIST file:
cat $3
echo "----"
cat $3 | perl -ne 'print "$1 $2 /cluster/share/gla012/mpi/$ENV{\"MYBIN\"}\n" if /^([\w\.]+)\s+(\d+)/;' > $PGFILE
echo "----"
echo "New NODELIST file:
cat $PGFILE
echo "----"
/opt/mpich-1.2.7p1/bin/mpirun $MYBIN -p4pg $PGFILE
There are, however, two problems which I can see.
- Accounting. Looking at the torque logs it's clear that only the master node's process is being accounted for. The slave node MPI processes are not accounted for. Do we multiply the master node's CPU and Wall by the node number as an interim measure?
- Orphaned and stray processes. As ssh is used to start the binary on the slave nodes, what happens if the code leaves them behind or they run away?
(For more formal documentation, watch this wiki page....)
No comments:
Post a Comment