Thursday, July 28, 2011

Circuits, Circuits everywhere but not a drop to switch

Since the late afternoon of the 26th of July we have been working to resume service on the Cluster at Glasgow.
We were put into unexpected downtime by our old friend; the power cut.

The root cause of this appears to be that the local mains supply into the site failed and was sub-sequentially re-instated. However, we decided to restart the cluster on Wednesday morning, to ensure that there was a clean and stable supply into the site. So off to the Gocdb, announce the unscheduled downtime and proceed.

While normally we would have immediately started on getting the cluster back online, as it turned out we couldn't have got ourselves back into production any sooner due to the residual issues caused by the power outage. As we have had several power interruptions at the site over the last 10 months, we have now got a reasonably robust restart procedure and we started this on Wednesday morning.

Initially, we had absolutely no issues surrounding the reset of both rooms, bar the loss of a rather expensive 10 Gig Ethernet interface on one of the new Dell Switches and the loss of the switch configuration files, which was caused by yours truly not running a copy run start on the switch after configuring a LAG group and QOS. We reconfigured the switch and all connectivity across the cluster was confirmed as good.

We then proceeded to rebuild our one of our internal stacks to free up the 10 Gig Interfaces on a Nortel 5530, which we had planned to move to our lower server room to build out the second 10 Gig link, mentioned in a previous post. This too went surprisingly well, but Dave and myself had pretested building the stack and adding and removing devices and inserting new base units on older test equipment.

We then retested again Stacking, LAGs were working fine, Spanning tree was happy and the Cluster's network was in good shape. We then moved to phase 2 of the upgrade which was to insert the 5530 switch into the switch stack in the downstairs server room. After we inserted the switch in the stack, it came up and the entire stack stabilised and then started to forward traffic.

However, about 3 minutes later we started to see the latency in the network rise and hosts fail to contact one another. Ping, SSH and normal cluster network traffic such as NFS, NTP and DNS also started to experience issues. We reduced the load on the network by detaching hosts from it but to no avail. We then removed the 5530 from the stack but the problem remained. Over the next 4 hours we tried a variety of tests which were all ending with either the dreaded Host Unreachable or 142 millisecond response times. To make matters worse (confusing), the switches were reporting an internal response time between room of 0.50 milliseconds via ping but telnet and ssh between devices was also timing out.

As we were unable to ascertain the exact root cause, we called a break and went and got some air.

20 minutes and one pizza slice later, it occurred  to me that if no device on the network was generating traffic at the volume required to generate a 94% packet loss scenario across multiple 10 Gig connections, then it has to be the network itself. Or rather what is attached to it.

The 10 Gig Interface being cooked wasn't the cause as it was dead at this point, but the power cut had left another present:

Damaged Ethernet Cables.

As the Cluster is too large to manually go round and check every cable individually with a line tester, we did something that I, as a former telco engineer, don't like doing. We rebooted the switches in numbered sequence. Starting with Stack01.

The purpose of this test is to isolate as quickly as possible the damaged cable, device or interface by pinging across the cluster from one room to another and intra switch if need be.

So Ping from Svr001 (upstairs) to Node141 (downstairs).
Destination Host Unreachable.
Leave the ping running.
Reboot Stack 01.
Ping response time of 0.056 miliseconds
Stack01 reloads.
Destination Host Unreachable.

We repeated this test twice. And got the same result.

So onto Stack01. The partner switch which trunks into this stack to affect an uplink onto the core of our network did not report any errors on the multi-link trunk but also very little traffic. Neither did Stack01, until I tried to ping its loopback address from the partner switch. The error rate on the interfaces increased and CRC counters were recorded. So we systematically disabled the multi-link trunk link by link until the stack interconnect stablised.

This reduced the trunk's capacity substantially but it also stabilised the network. So we added the 5530 back into the Stack downstairs, turned on the partner ports upstairs and were awarded with a 20 Gig backbone which is now operational at the Glasgow site.

As for the old LAG connection it was stripped out completely this morning and by early afternoon we had re-instated a 6 Gig connection to Stack01 which is working happily. From here we brought the site out of downtime and are back on the Grid.

We are putting in place an  internal tftp process for backing up switch configurations each night.

The main lesson from this is that on a large layer 2 environment, the smallest issue can become a major one and plans are well advanced on the next set of configuration changes to the network at Glasgow, to get around this and other potential issues in the future.






No comments: