On the release note for openmpi-1.4.1 it appears they discovered this bug and provided a fix:
- Fix a shared memory "hang" problem that occurred on x86/x86_64
platforms when used with the GNU >=4.4.x compiler series.
This sounded plausible and in fact an upgrade has fixed the issue.
So now with all 8 cores running on the same node the next issue to arise was one related to Maui. Some time when you requested nodes=8, Maui scheduled the job on 3 cores, a qdel and a resubmission later Maui rescheduled the job onto 5 cores. On one test I even qrun'd the job and it appeared it start on the correct number of nodes but there appeared to be no reason for Maui not getting this correct. So it was time to get out Maui docs.
from the docs;
Maui is by default very liberal in its interpretation of
This seemed to suggest that Maui's default behaviour is to pack a job into as few nodes as possible. So I tried out setting the JOBNODEMATCHPOLICY to EXACTNODE and this seems to have done the trick.
nodes=24 means 24 nodes, not 8, not 6 but 24
This does have a drawback in that it will be 24 separate nodes. This setting relies upon being able to set :ppn (processes per node) to allow nodes=3:ppn=8 giving 24 cores which is really what you want to say. As you probably have a fast machine with loads of memory and cores. Therefore, you could target all the cores rather than 24 nodes. However, it is a start.
Wouldn't it be nice if you could specify :ppn in JDL. The only way round this I can see for now is to manually change the job manager or use the local batch attributes of CREAM to allow a custom cerequirement to be specified. Possible but not nice.