Failed SAM tests all day. When I checked the logs they'd all run on
node006. Logged in and...
Oct 11 16:58:40 node006 smartd[3163]: Device: /dev/hda, 1 Currently unreadable (pending) sectors
Oct 11 17:28:40 node006 smartd[3163]: Device: /dev/hda, 1 Currently unreadable (pending) sectors
Oct 11 17:58:41 node006 smartd[3163]: Device: /dev/hda, 1 Currently unreadable (pending) sectors
Oct 11 18:00:14 node006 pbs_mom: Invalid argument (22) in mem_sum, 5754: get_proc_stat
Oct 11 18:13:23 node006 pbs_mom: Invalid argument (22) in resi_sum, 8121: get_proc_stat
Oct 11 18:28:40 node006 smartd[3163]: Device: /dev/hda, 1 Currently unreadable (pending) sectors
Oct 11 18:58:40 node006 smartd[3163]: Device: /dev/hda, 1 Currently unreadable (pending) sectors
Oct 11 19:28:40 node006 smartd[3163]: Device: /dev/hda, 1 Currently unreadable (pending) sectors
Oct 11 19:44:32 node006 pbs_mom: Invalid argument (22) in resi_sum, 9482: get_proc_stat
Took it offline and immediately we're back.
It's just amazing that one bad node in 142 can kill off a whole site for SAM... it took out 3626 jobs in less than 12 hours.
This is really torque's fault - it should have a bad node sensor at the batch system level.
(As an aside it didn't affect ATLAS production at all, because if a node is so bad that the pilot doesn't start then it never pulls in a real job.)
1 comment:
Torque supports node health checks. Not sure how you would detect a failing disk though from a shell script.
Post a Comment