We are currently supporting two OS's through the same batch system using submit filter and node properties. Therefore, we have two different versions of moms.
Nodes 1->295 have moms torque-2.3.6-2cri.x86_64 and 296->309 have moms torque-2.1.9-4cri.slc4.i386
When the moms segfault we see that the torque-2.1.9 moms stay up and only the torque-2.3.6 moms all die. I ran one of them through GDB and can see the call stack:
(gdb) where
#0 mom_server_find_by_ip (search_ipaddr=177078032) at mom_server.c:450
#1 0x000000000041965e in mom_server_valid_message_source (stream=0) at mom_server.c:2022
#2 0x0000000000419870 in is_request (stream=0, version=1, cmdp=0x7fffff542ae8) at mom_server.c:2125
#3 0x0000000000416997 in do_rpp (stream=0) at mom_main.c:5351
#4 0x0000000000416a52 in rpp_request (fd=) at mom_main.c:5408
#5 0x00002ae8ae9f3bc8 in wait_request (waittime=, SState=0x0) at ../Libnet/net_server.c:469
#6 0x0000000000416c1d in main_loop () at mom_main.c:8046
#7 0x0000000000416ee1 in main (argc=1, argv=0x7fffff5431d8) at mom_main.c:8148
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) n
Program not restarted.
(gdb) bt full
#0 mom_server_find_by_ip (search_ipaddr=177078032) at mom_server.c:450
__v =
pms = (mom_server *) 0x6cbb80
addr =
#1 0x000000000041965e in mom_server_valid_message_source (stream=0) at mom_server.c:2022
addr = (struct sockaddr_in *) 0x187ef434
pms = (mom_server *) 0x0
id = 0x43be08 "mom_server_valid_message_source"
#2 0x0000000000419870 in is_request (stream=0, version=1, cmdp=0x7fffff542ae8) at mom_server.c:2125
command =
ret = 0
pms =
ipaddr =
id = "is_request"
So it looks like time to dive through the source for
mom_server_find_by_ip (search_ipaddr=177078032) at mom_server.c:450
or install torque-2.4!
1 comment:
This is getting rather urgent to resolve. We're leaking about 10% of ATLAS production work with jobs killed by torque. It definitely happens in bursts.
If we have to change torque versions (and it seems to be we have to) we should make the move as soon as we can.
Post a Comment