Wednesday, June 13, 2007

Bad Worker Node, Bad, Bad...

node040 had the wrong gateway, via svr031 instead of nat005. When I gated it through nat005 external networking started to work. When svr031 was reinstalled we turned off packet routing on svr031 - however at the time we switched the rest of the cluster to nat005 (when svr031 lost its brain) node040 was down. It never got the change and the default gateway is set at install time - not controlled by cfengine.

In theory this should never happen again - all the install time network files are now correct and nodes should be re-installed when they are brought back into service. However, a nagios monitor which tests external networking should be implemented.

The reason the problem kicks in when the cluster gets full is that node040 can't run jobs (can't get the input sandbox from the RB), so the RB job wrapper gives up after ~5 minutes. Frees up the job slot and another job gets sucked in to its doom. When the cluster is less full this is much more sporadic, as a resubmitted job is likely to go to a different (functioning) node. When I looked at the logs I saw that node040 killed a remarkable number of jobs in its 8 day reign of terror: 1835 out of 11885 have been sent to their doom (15.4%).

For interest, I wrote a little python log parser, which can print number of jobs, avg. cpu and wall time (in minutes). Even just running it over Thursday's pbs log shows up node040 as a bad place to be:

svr016:/var/spool/pbs/server_priv/accounting# nodestat.py 20070607 | sort -k 2 -n
node043: 4 1069.9 1165.9
node045: 4 1080.3 1178.0
node048: 4 1185.8 1195.9
node013: 5 1171.3 1345.5
node032: 5 959.9 1030.7
[...]
node088: 12 438.0 445.3
node019: 13 433.7 437.4
node023: 13 667.8 719.0
node103: 13 397.1 430.7
node139: 62 70.4 80.3
node040: 646 0.5 4.7

Note that node139 is the node most of the ops tests run on, because they have a reserved job slot - even though it's not strictly tied to a node it rarely runs anywhere else. This is really a pain, because I'm sure we'd have picked up the problem earlier - ops jobs never get resubmitted. Perhaps we should remove that reservation. I'm fairly confident that with 500+ job slots we'll get something coming free in under an hour. (6 days / 500 ~ 20min).

No comments: