Wednesday, February 17, 2010

more openmpi tweaking

Whilst testing MPI on our cluster and get it into a usable state I uncovered a rather nasty bug with openmpi-1.3.4. This manifested itself with never being able to run on the node with cores > 4. It was a weird one as openmpi communication over two nodes worked fine with 8 cores on each node but when a job requested cores > 4 on the same node. The job just hung. An strace of the mpiexec process suggested some sort of TIMEOUT/WAIT issue.

On the release note for openmpi-1.4.1 it appears they discovered this bug and provided a fix:
- Fix a shared memory "hang" problem that occurred on x86/x86_64
platforms when used with the GNU >=4.4.x compiler series.

This sounded plausible and in fact an upgrade has fixed the issue.

So now with all 8 cores running on the same node the next issue to arise was one related to Maui. Some time when you requested nodes=8, Maui scheduled the job on 3 cores, a qdel and a resubmission later Maui rescheduled the job onto 5 cores. On one test I even qrun'd the job and it appeared it start on the correct number of nodes but there appeared to be no reason for Maui not getting this correct. So it was time to get out Maui docs.

from the docs;
Maui is by default very liberal in its interpretation of :PPN=. In its standard configuration, Maui interprets this as 'give the job * tasks with AT LEAST tasks per node'. Set the JOBNODEMATCHPOLICY parameter to EXACTNODE to have Maui support PBS's default allocation behavior of nodes with exactly tasks per node.

This seemed to suggest that Maui's default behaviour is to pack a job into as few nodes as possible. So I tried out setting the JOBNODEMATCHPOLICY to EXACTNODE and this seems to have done the trick.

nodes=24 means 24 nodes, not 8, not 6 but 24

This does have a drawback in that it will be 24 separate nodes. This setting relies upon being able to set :ppn (processes per node) to allow nodes=3:ppn=8 giving 24 cores which is really what you want to say. As you probably have a fast machine with loads of memory and cores. Therefore, you could target all the cores rather than 24 nodes. However, it is a start.

Wouldn't it be nice if you could specify :ppn in JDL. The only way round this I can see for now is to manually change the job manager or use the local batch attributes of CREAM to allow a custom cerequirement to be specified. Possible but not nice.

Tuesday, February 16, 2010

cream sours

Well we now know how much it takes to kill our CREAM instance. Yesterday it stoppped working completely and it appeared to be caught in a tailspin with the Lease and Proxy Renew processes within CREAM. Grepping the logs indicated that most of the Renewals and Lease Manager entries were all related to condor submission from ATLAS.

From speaking to Massimo at INFN it was described how Proxy and Lease renewals are operations which are executed with higher priorities wrt other commands. One hypothesis might be that the CREAM CE was so overloaded doing these commands that it was unable to deal with basic job submission since all the test jobs I submitted never made it out out the REGISTERED state.

It looked bad on Ganglia:


The first course of action was to disable job submission using the command line tool: glite-ce-disable-submission and try to deal with the renewals. This worked for a time but they reoccurred later on that evening.

The timestamps on these ATLAS cream jobs seems to be very old and hinted at stale jobs so the next course of action was to manually purge the database using the tool provided by the CREAM developers: here. The easiest way I could see to do this was to connect to the creamdb, select out the id's and create a script that called the purger for each id. Note: you need jdk 1.6 in order to run the purger!

This ended up removing around 3000 CREAM entries.

Ganglia looked much happier:


So I think you have to be careful when getting submissions from Condor at the moment as it looks to be quite easy to denial of service your CREAM CE.

Roll on CREAM 1.6

- That proxy renewal is not very efficient in the release now in production (already addressed in the coming CREAM CE: see here)
- When there are too many pending commands, new job submissions will be disabled by the limiter: see here