Thursday, November 19, 2009

Segfaulting PbsMoms

We have an issue with segfaulting mom's that seems correlated with the server trying to ping it's moms. The server are version is torque-2.3.6-2cri.x86_64
We are currently supporting two OS's through the same batch system using submit filter and node properties. Therefore, we have two different versions of moms.
Nodes 1->295 have moms torque-2.3.6-2cri.x86_64 and 296->309 have moms torque-2.1.9-4cri.slc4.i386

When the moms segfault we see that the torque-2.1.9 moms stay up and only the torque-2.3.6 moms all die. I ran one of them through GDB and can see the call stack:

(gdb) where
#0 mom_server_find_by_ip (search_ipaddr=177078032) at mom_server.c:450
#1 0x000000000041965e in mom_server_valid_message_source (stream=0) at mom_server.c:2022
#2 0x0000000000419870 in is_request (stream=0, version=1, cmdp=0x7fffff542ae8) at mom_server.c:2125
#3 0x0000000000416997 in do_rpp (stream=0) at mom_main.c:5351
#4 0x0000000000416a52 in rpp_request (fd=) at mom_main.c:5408
#5 0x00002ae8ae9f3bc8 in wait_request (waittime=, SState=0x0) at ../Libnet/net_server.c:469
#6 0x0000000000416c1d in main_loop () at mom_main.c:8046
#7 0x0000000000416ee1 in main (argc=1, argv=0x7fffff5431d8) at mom_main.c:8148
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) n
Program not restarted.
(gdb) bt full
#0 mom_server_find_by_ip (search_ipaddr=177078032) at mom_server.c:450
__v =
pms = (mom_server *) 0x6cbb80
addr =
#1 0x000000000041965e in mom_server_valid_message_source (stream=0) at mom_server.c:2022
addr = (struct sockaddr_in *) 0x187ef434
pms = (mom_server *) 0x0
id = 0x43be08 "mom_server_valid_message_source"
#2 0x0000000000419870 in is_request (stream=0, version=1, cmdp=0x7fffff542ae8) at mom_server.c:2125
command =
ret = 0
pms =
ipaddr =
id = "is_request"


So it looks like time to dive through the source for mom_server_find_by_ip (search_ipaddr=177078032) at mom_server.c:450 or install torque-2.4!

1 comment:

Graeme Stewart said...

This is getting rather urgent to resolve. We're leaking about 10% of ATLAS production work with jobs killed by torque. It definitely happens in bursts.

If we have to change torque versions (and it seems to be we have to) we should make the move as soon as we can.