Wednesday, March 23, 2011

ScotGrid Reloaded

As it is spring, we have decided to revamp the blog.
We will be updating the blog over the next couple of weeks and tinkering with the layout.
Please Stand By.

Spanning Tree, oh Spanning Tree

Following last week's power outages we were encountering issues with Spanning Tree reconvergence on our older switching equipment. The Nortel 5510 and 5530 switches which have been stalwarts of the Glasgow cluster install were experiencing a major rise in the number of BPDU's being transmitted, since the second power outage as well as an increase in the number of dropped packets across all interfaces. The cause of these two issues are partially inter-related. The switches had suffered a partial loss of configuration on the second power outage which resulted in several services including their NTP client and Spanning Tree to behave erratically. To resolve the Spanning Tree issue, the configuration was returned to the defaults for the protocol on the Nortel switches. This is shown below:

Hello Time:                 2 seconds
Maximum Age Time:           20 seconds
Forward Delay:              15 seconds
Bridge Hello Time:          2 seconds
Bridge Maximum Age Time:    20 seconds
Bridge Forward Delay:       15 seconds


This stabilised the switches within the older Cluster and reduced the volume of BPDU's that we being sent to the core switch.

An overview of the Spanning Tree Protocol is available here: http://en.wikipedia.org/wiki/Spanning_Tree_Protocol

The second issue surrounding problems with dropped packets and pause frames was again related to the power outage and it appears this had resulted in several dozen worker nodes having problems communicating across the switch environment. This issue was improved by the nodes being off-lined and then rebooted after the network reset.

We are still monitoring the situation and will report on any other action taken if required.

Monday, March 21, 2011

Power Issues Redux

On the 15th of March we encountered two power outages within the Campus supply at Glasgow University. We had to put ourselves into downtime and remove ourselves from ATLAS production to affect a recovery from these power cuts. While the UPS infrastructure held up, we thought it prudent not to expose our user community to potential disruption.
The root cause of these outages has now been repaired and we came out of downtime on Thursday the 17th of March.

Wednesday, March 02, 2011

Wide Area Wonder

After several month's of investigating asymmetric traffic flows from Glasgow to RAL, we have finally appear to have resolved the issue. Working with internal Computing Services staff at the University of Glasgow and GridPP staff at RAL we are now seeing sustained simultaneous transfer speeds around 2.3 Gig a second inbound and outbound.

The commands run for tests are shown below:

iperf -s -u -p 5001 -w 2M (client command to receive data)
iperf  -d -u -p 5001 -t 600 -w 1M -c hostname -b 700M -i 30 (server command to send data)

Associated network interface card and CPU loads on device one of the tests were run.




Effectively, the Glasgow site is now an extension of the Clydenet to JANET infrastructure in the west of Scotland and we will be monitoring the services over the next month to ensure that this network solution is as stable and reliable as the previous interconnection.

In addition to this work we will be investigating in Glasgow the optimisation of the Layer2 to Layer 3 network infrastructure between ourselves, the University and the rest of Gridpp over the next 3 months.