Saturday, October 11, 2008

chew 'em up, spit 'em out...

Failed SAM tests all day. When I checked the logs they'd all run on
node006. Logged in and...

Oct 11 16:58:40 node006 smartd[3163]: Device: /dev/hda, 1 Currently unreadable (pending) sectors
Oct 11 17:28:40 node006 smartd[3163]: Device: /dev/hda, 1 Currently unreadable (pending) sectors
Oct 11 17:58:41 node006 smartd[3163]: Device: /dev/hda, 1 Currently unreadable (pending) sectors
Oct 11 18:00:14 node006 pbs_mom: Invalid argument (22) in mem_sum, 5754: get_proc_stat
Oct 11 18:13:23 node006 pbs_mom: Invalid argument (22) in resi_sum, 8121: get_proc_stat
Oct 11 18:28:40 node006 smartd[3163]: Device: /dev/hda, 1 Currently unreadable (pending) sectors
Oct 11 18:58:40 node006 smartd[3163]: Device: /dev/hda, 1 Currently unreadable (pending) sectors
Oct 11 19:28:40 node006 smartd[3163]: Device: /dev/hda, 1 Currently unreadable (pending) sectors
Oct 11 19:44:32 node006 pbs_mom: Invalid argument (22) in resi_sum, 9482: get_proc_stat

Took it offline and immediately we're back.

It's just amazing that one bad node in 142 can kill off a whole site for SAM... it took out 3626 jobs in less than 12 hours.

This is really torque's fault - it should have a bad node sensor at the batch system level.

(As an aside it didn't affect ATLAS production at all, because if a node is so bad that the pilot doesn't start then it never pulls in a real job.)

1 comment:

Craig Macdonald said...

Torque supports node health checks. Not sure how you would detect a failing disk though from a shell script.