Tuesday, January 19, 2010

pick a torque, any torque

Since our seg-faulting mom issue during our SL5 upgrade using 2.3.6 server & mom I have compiled a variety of Torque versions of late and trialled them out. I have now come to some conclusion and am sticking with the 2.3.* series. 2.3.9 at the moment - well until another bug is found!

2.3 Series

2.3.6 - seg-faulting mom during some unidentifiable race condition
2.3.7 - untested
2.3.8 - Operators/Managers Lists Bug
2.3.9 - Seems stable

2.4 Series - Beta

2.4.2 - OSC MPIEXEC Bug
2.4.3 - OSC MPIEXEC Bug Fixed & Operators/Managers Lists Bug
2.4.4 - OSC MPIEXEC Bug Back in

11 comments:

Christos Triantafyllidis said...

Hi,
it looks like torque is failing (seg fails) when you mix versions between client and server.

We had the same issue when using the 2.3.0 version of torque server and 2.3.6 version of clients.

The other way around seemed to work (having updated server and downgraded client).

Christos

dug mcnab said...

Hey,

Well we thought that this was the case too. But when we had segfaults we have 2.3.6 in both client and server. We fixed the issue by upgrading the mom with the various versions stated in the post.

So I would be wary of just blaming the different version of mom and server. Although I would rather have them the same!

Dug

Steve Traylen said...

On a related 2.3.9 is now available for some testing as an EPEL release.

https://admin.fedoraproject.org/updates/torque

packages available from the epel-testing repos.

The significant difference is /var/spool/pbs becomes /var/torque.

I've tested it with current EGEE maui release and at small scales at least works for trivial items.

chris said...

Hi Dug,

Have you reported the crash at all? The Torque community is very active and usually you can get a quick turnaround. It'd help strengthen the product as well.

http://www.supercluster.org/pipermail/torquedev

http://www.supercluster.org/pipermail/torqueusers

Tim Dyce said...

Hey,

We saw the same issues with the 2.3.6 torque release in Melbourne. We upgraded to Steve's 2.3.9 packages for server and mom, and no problems since.

Tim

Steve Traylen said...

Hi Tim,

Could you describe what was needed to use the epel packages in particular with respect to the /var/spool/pbs -> /var/torque migration.

Steve

Tim Dyce said...
This comment has been removed by the author.
Tim Dyce said...

Hi Steve,

The /var/spool/pbs -> /var/torque migration (as in the EPEL packages) was pretty straight forward, the big gotchas lie in adding the stuff that YAIM normally does for you. Since YAIM setup will auto-configure the worker nodes; but only populate the /var/spool/pbs directory.

For the worker nodes this meant adding:
- /var/torque/server_name
- /var/torque/mom_priv/config
We distributed these via cfengine, but you could just as easily copy the existing YAIM generated versions:
- /var/spool/pbs/server_name
- /var/spool/pbs/mom_priv/config

For the server side, since we upgraded both, we needed to alter cf to point at (we never let YAIM generate these anyway).
- /var/torque/server_name
- /var/torque/server_priv/nodes
Which you can get from:
- /var/spool/pbs/server_name
- /var/spool/pbs/server_priv/nodes

Updating APEL
The APEL pasrer on our LGC-CE uses the pbs logs, NFS exported from the pbs server, to reconcile against the gatekeeper logs and generate the accounting information. Our LCG-CE and PBS server are separate, so I just updated the NFS export on the PBS server and the autofs import on the LCG-CE.
If the LCG-CE and the PBS server were on just one host, you would need to update /opt/glite/etc/glite-apel-pbs/parser-config-yaim.xml (take a look at /etc/cron.d/edg-apel-pbs-parser). Update the line containing:
/var/spool/pbs/server_priv/accounting

CREAM-CE
Depending on which version of the blahd parser you are using, you may need to update the parser on the cream CE.

Our PBS server and moms have been nice and stable since the upgrade. We have seen some occasional PBS crankiness when the pbs_server service is restarted, but that's pretty normal.

The only thing to be wary of is that any YAIM reconfigures will no longer have any effect on the PBS config since they will operate on now defunct files.

Tim

Tim Dyce said...

Also, you will want to look at the YAIM variables....
BATCH_LOG_DIR

Steve Traylen said...

I was about to release 2.3.9 but given 2.3.10 is out have added that instead to EPEL testing.

https://admin.fedoraproject.org/updates/torque

It will take around three weeks before I can release this new one.

Steve Traylen said...

Hi,

The're released the full monty of i386/x86_64 RHEL4 and 5. And in fact ppc and s390 if you have one lying around.

libtorque.x86_64 2.3.10-1.el4 epel
libtorque.i386 2.3.10-1.el4 epel
libtorque-devel.i386 2.3.10-1.el4 epel
libtorque-devel.x86_64 2.3.10-1.el4 epel
torque.x86_64 2.3.10-1.el4 epel
torque-client.x86_64 2.3.10-1.el4 epel
torque-docs.x86_64 2.3.10-1.el4 epel
torque-gui.x86_64 2.3.10-1.el4 epel
torque-mom.x86_64 2.3.10-1.el4 epel
torque-pam.x86_64 2.3.10-1.el4 epel
torque-pam.i386 2.3.10-1.el4 epel
torque-scheduler.x86_64 2.3.10-1.el4 epel
torque-server.x86_64 2.3.10-1.el4 epel

Apologies for the debian style package naming. Not my choice.