Tuesday, April 17, 2012

One XOS, Great Big Purple Packet Eater. Sure looks good to me.

So we haven't been blogging a great deal since December and for good reason. We found ourselves in the exciting position of being given additional funding to enhance our network capability and also we had additional equipment to install into the cluster.

First things first however, as you may have read we have had no end of issues with the older network equipment. We had a multi-vendor environment which, while adequate for 800 analysis jobs and 1200 production jobs, wasn't quite up to cutting the mustard as we couldn't expand from there.

The main reason was the 20 Gig link between the two computing rooms which was having real capacity issues. Also, add in issues between the Dell and Nortel LAG and associated back flow problems, sprinkled with a buffer memory issue on the 5510s and you get the picture. In addition to this we were running out of 10 Gig ports and therefore couldn't get much bigger without some investment.

Therefore, the grant award was a welcome attempt to fix this issue. After going to tender we decided upon equipment from Extreme Networks. The proposed solution allowed for a vast 160 Gigabit interconnect between the rooms broken into two resilient link bundles in the Core and an 80 Gigabit Edge layer. In addition to this connection we also installed a 32 core OM4 grade fiber optic network for the cluster which will carry us into the realms of 100 Gigabit connections, when it becomes available and cheap enough to deploy sensibly.

We now have 40 x 40 gigabit port, 208 x 10 gigabit ports and 576 1 x Gigabit ports available for the Cluster.

 There is quick and clever and here it is

The new deployment utilises X670s  in the Core and X460s at the Edge.

The magic of the new Extreme Network is that it uses EAPS, so bye bye Spanning Tree and good riddance as well as MLAG which allows us to load share traffic across the two rooms so having 10 Gigabit connections for disk servers in one room is no longer an issue.

Then it got a bit better. Due to the Extreme OS we can now write scripts to handle events within the network which ties in with the longer term plan for a Cluster Expert System (ARCTURUS) which we are currently designing for test deployment. More on this after August.

Finally, it even comes with its own event monitoring software, Ridgeline which gives a GUI interface to the whole deployment.

We stripped out the old network installed the new one and after some initial problems with the configuration, which were fixed in a most awesome fashion by Extreme got the new one up and running. What we can say is that the network isn't a problem anymore, at all.

This has allowed us to start to concentrate upon other issues within the Cluster and look at the finalised deployment of the IPV6 test cluster which has benefited in terms of hardware from the new network install. Again, more on this soon.

Right, so now to the rest of the upgrade we have also extended our cold isle enclosure to 12 racks, have a secondary 10 Gig link onto the Campus being installed and have a UPS. In Addition to this we refreshed our storage using Dell R510s and M1200s as well as buying 5 Interlagos boxes to augment the worker node deployment.

 The TARDIS just keeps growing

We also invested in an experimental user access system with wi-fi and will be trying this out in the test cluster to see if a wi-fi mesh environment can support a limited number of grid jobs.  As you do.

In addition to this we improved connectivity for the research community in PPE at Glasgow and across the Campus as a whole, with part of the award being used to deliver the resilient second link and associated switching fabrics.

It hasn't been the most straight forward process as the decommissioning and deployment work was complex and very time consuming in an attempt to keep the cluster up and running as long as possible and to minimise down times.

We didn't quite manage this as well as expected due to the configuration issues on the new network but we have now upgraded the entire network and have removed multiple older servers from the cluster to allow us to enhance the entire batch system for the next 24 - 48 months.

As we continue to implement additional upgrades to the cluster we will keep you informed.
For now it is back to the computer rooms.

No comments: