I previously blogged about our Torque 2.3.6 on SL5 mom's continually seg-faulting. At first we thought it was a bitness issue 32/64 between our SL4 and SL5 mom's running through the same pbs_server. However, a quick test with the SL4 nodes removed proved that this was not the case. A trawl through the source proved unproductive.
Therefore, it was time to go to Plan B. To that end I have built the latest Torque release 2.4.2 and tested this on our pre-prod staging cluster. This worked well with a configuration of 2.3.6 server and 2.4.2 mom's. The next test was a test on a single node in production. This was successful and was running jobs fine when all the other mom's seg-faulted again. The 2.4.2 mom survived this and continued to run. So a full roll-out is under way. We will think about upgrading the server at a later date. The only point to note is that we have to fully drain a node before doing the upgrade which is pain. It does attempt a job conversion but these are unsuccessful as far as we can tell and you end up with dead job holding onto job slots.
So the moral of the story is stay away from 2.3.6 and go to 2.4.2 instead.
It is pretty easy to build but I have hosted our build here for anyone that wants them.