Wednesday, March 23, 2011

Spanning Tree, oh Spanning Tree

Following last week's power outages we were encountering issues with Spanning Tree reconvergence on our older switching equipment. The Nortel 5510 and 5530 switches which have been stalwarts of the Glasgow cluster install were experiencing a major rise in the number of BPDU's being transmitted, since the second power outage as well as an increase in the number of dropped packets across all interfaces. The cause of these two issues are partially inter-related. The switches had suffered a partial loss of configuration on the second power outage which resulted in several services including their NTP client and Spanning Tree to behave erratically. To resolve the Spanning Tree issue, the configuration was returned to the defaults for the protocol on the Nortel switches. This is shown below:

Hello Time:                 2 seconds
Maximum Age Time:           20 seconds
Forward Delay:              15 seconds
Bridge Hello Time:          2 seconds
Bridge Maximum Age Time:    20 seconds
Bridge Forward Delay:       15 seconds

This stabilised the switches within the older Cluster and reduced the volume of BPDU's that we being sent to the core switch.

An overview of the Spanning Tree Protocol is available here:

The second issue surrounding problems with dropped packets and pause frames was again related to the power outage and it appears this had resulted in several dozen worker nodes having problems communicating across the switch environment. This issue was improved by the nodes being off-lined and then rebooted after the network reset.

We are still monitoring the situation and will report on any other action taken if required.

