Since the last post on the blog we have implemented a series of measures on the network which were planned to be deployed during the next Cluster refresh.
Primarily, we have migrated elements of our core servers such as svr020, svr001 and svr008 to the new Dell switch infrastructure and have introduced a series of Link Aggregation Groups (LAGS) across the Dell estate to raise their backbone to a full 20 Gigabits per second intra switch. This has led to the decommissioning of the core 10 Gigabit interconnect into our old Nortel gateway, stack01and this has been replaced with another LAG between the Dell's and stack01. The reason behind this will become clear in the next post.
The main upshot of this part of the network upgrade is that we now can have greater control over the network services and monitoring running out of these servers such as SNTP and Gangli respectively. These can be fine tuned to a greater degree on the Dell environment to minimise the broadcast and Layer 2 multicast impact of these services.
However, that is not to say that the Nortel's are on the way out quite yet. Our Torque and Maui Server, svr016, still resides on older Nortel equipment in Stack02 which is currently connected to the new Dell infrastructure by a 10 Gig fibre. This link is occasionally saturating; we have decided to upgrade the link to 20 Gigabits by running a new multimode fibre between the two computer rooms, 141 and 243d. We also decided to implement Layer 2 QOS for Server016 to ensure that it got priority over all other cluster traffic within the stack and through the core network switches.
Therefore, we embarked on the re-configuration on the QOS parameters on Stack02. The complexity behind this lies not in the actual end configuration: effectively the mac address of svr016 is tracked across VLAN's 1 and 2 respectively to ensure that a Gold Quality of Service is met for any device wishing to speak to or be spoken to by Svr016. The real complexity is implementing this so that you don't disable the entire cluster attached to the network stack.
Earlier implementations of the Nortel OS had a nasty tendency to drop all non-specified traffic within the network, and the QOS policy generation, while incredibly granular in its ability to tag and filter traffic, involves 6 different stages to ensure that traffic is correctly tagged and forwarded.
Added to the fact that if the MAC address do not have the correct MAC address mask all traffic generated by Svr016 will be dropped, effectively disabling the cluster for a period of time, a general picture of the care required to implement this feature developed on our part.
Sam and myself rechecked the configurations twice before attempting to implement them. However, when we attempted to commit we discovered that the Nortel GUI is a lot more thorough in its checks than we could ever have imagined. Due to a mis-configuration of the MAC address mask the system refused to commit it to the switches. It even supplied an error message which identified that the mask was wrong.
Once the mask had been corrected the configuration was loaded onto stack02 and immediately started to work. The image below shows the packet matching since the 30th of June 2011.
Now for the real test. How would it cope under increased DPM traffic loads?
Surprisingly well: it turns out as now all traffic to and from svr016 has a low drop status and high precedence value across the network.
The images below show the system performance during one of this recent event.
As can be seen, there is no real increase in activity now as the QOS mappings for svr016 now mean that, while it is still part of the production and external VLANs it always travels 1st class.
The next phase of QOS development is to start to investigate the corralling of network broadcasts for services such as NFS to see if we can reduce the background chatter on the network without impacting service.