Wednesday, February 17, 2010

more openmpi tweaking

Whilst testing MPI on our cluster and get it into a usable state I uncovered a rather nasty bug with openmpi-1.3.4. This manifested itself with never being able to run on the node with cores > 4. It was a weird one as openmpi communication over two nodes worked fine with 8 cores on each node but when a job requested cores > 4 on the same node. The job just hung. An strace of the mpiexec process suggested some sort of TIMEOUT/WAIT issue.

On the release note for openmpi-1.4.1 it appears they discovered this bug and provided a fix:
- Fix a shared memory "hang" problem that occurred on x86/x86_64
platforms when used with the GNU >=4.4.x compiler series.

This sounded plausible and in fact an upgrade has fixed the issue.

So now with all 8 cores running on the same node the next issue to arise was one related to Maui. Some time when you requested nodes=8, Maui scheduled the job on 3 cores, a qdel and a resubmission later Maui rescheduled the job onto 5 cores. On one test I even qrun'd the job and it appeared it start on the correct number of nodes but there appeared to be no reason for Maui not getting this correct. So it was time to get out Maui docs.

from the docs;
Maui is by default very liberal in its interpretation of :PPN=. In its standard configuration, Maui interprets this as 'give the job * tasks with AT LEAST tasks per node'. Set the JOBNODEMATCHPOLICY parameter to EXACTNODE to have Maui support PBS's default allocation behavior of nodes with exactly tasks per node.

This seemed to suggest that Maui's default behaviour is to pack a job into as few nodes as possible. So I tried out setting the JOBNODEMATCHPOLICY to EXACTNODE and this seems to have done the trick.

nodes=24 means 24 nodes, not 8, not 6 but 24

This does have a drawback in that it will be 24 separate nodes. This setting relies upon being able to set :ppn (processes per node) to allow nodes=3:ppn=8 giving 24 cores which is really what you want to say. As you probably have a fast machine with loads of memory and cores. Therefore, you could target all the cores rather than 24 nodes. However, it is a start.

Wouldn't it be nice if you could specify :ppn in JDL. The only way round this I can see for now is to manually change the job manager or use the local batch attributes of CREAM to allow a custom cerequirement to be specified. Possible but not nice.

2 comments:

Jeff Squyres said...

Glad that we fixed the problem for you in Open MPI! :-)

Be sure to let us know on the users list if you have any suggestions, tweaks to push upstream, etc.

Dennis van Dok said...

We've come a little closer with getting ppn in the JDL. At least when this bug is fixed the last step should be to make the translation in BLAH to the local batch system. It may take a while but we're getting there...