Thursday, July 28, 2011

Circuits, Circuits everywhere but not a drop to switch

Since the late afternoon of the 26th of July we have been working to resume service on the Cluster at Glasgow.
We were put into unexpected downtime by our old friend; the power cut.

The root cause of this appears to be that the local mains supply into the site failed and was sub-sequentially re-instated. However, we decided to restart the cluster on Wednesday morning, to ensure that there was a clean and stable supply into the site. So off to the Gocdb, announce the unscheduled downtime and proceed.

While normally we would have immediately started on getting the cluster back online, as it turned out we couldn't have got ourselves back into production any sooner due to the residual issues caused by the power outage. As we have had several power interruptions at the site over the last 10 months, we have now got a reasonably robust restart procedure and we started this on Wednesday morning.

Initially, we had absolutely no issues surrounding the reset of both rooms, bar the loss of a rather expensive 10 Gig Ethernet interface on one of the new Dell Switches and the loss of the switch configuration files, which was caused by yours truly not running a copy run start on the switch after configuring a LAG group and QOS. We reconfigured the switch and all connectivity across the cluster was confirmed as good.

We then proceeded to rebuild our one of our internal stacks to free up the 10 Gig Interfaces on a Nortel 5530, which we had planned to move to our lower server room to build out the second 10 Gig link, mentioned in a previous post. This too went surprisingly well, but Dave and myself had pretested building the stack and adding and removing devices and inserting new base units on older test equipment.

We then retested again Stacking, LAGs were working fine, Spanning tree was happy and the Cluster's network was in good shape. We then moved to phase 2 of the upgrade which was to insert the 5530 switch into the switch stack in the downstairs server room. After we inserted the switch in the stack, it came up and the entire stack stabilised and then started to forward traffic.

However, about 3 minutes later we started to see the latency in the network rise and hosts fail to contact one another. Ping, SSH and normal cluster network traffic such as NFS, NTP and DNS also started to experience issues. We reduced the load on the network by detaching hosts from it but to no avail. We then removed the 5530 from the stack but the problem remained. Over the next 4 hours we tried a variety of tests which were all ending with either the dreaded Host Unreachable or 142 millisecond response times. To make matters worse (confusing), the switches were reporting an internal response time between room of 0.50 milliseconds via ping but telnet and ssh between devices was also timing out.

As we were unable to ascertain the exact root cause, we called a break and went and got some air.

20 minutes and one pizza slice later, it occurred  to me that if no device on the network was generating traffic at the volume required to generate a 94% packet loss scenario across multiple 10 Gig connections, then it has to be the network itself. Or rather what is attached to it.

The 10 Gig Interface being cooked wasn't the cause as it was dead at this point, but the power cut had left another present:

Damaged Ethernet Cables.

As the Cluster is too large to manually go round and check every cable individually with a line tester, we did something that I, as a former telco engineer, don't like doing. We rebooted the switches in numbered sequence. Starting with Stack01.

The purpose of this test is to isolate as quickly as possible the damaged cable, device or interface by pinging across the cluster from one room to another and intra switch if need be.

So Ping from Svr001 (upstairs) to Node141 (downstairs).
Destination Host Unreachable.
Leave the ping running.
Reboot Stack 01.
Ping response time of 0.056 miliseconds
Stack01 reloads.
Destination Host Unreachable.

We repeated this test twice. And got the same result.

So onto Stack01. The partner switch which trunks into this stack to affect an uplink onto the core of our network did not report any errors on the multi-link trunk but also very little traffic. Neither did Stack01, until I tried to ping its loopback address from the partner switch. The error rate on the interfaces increased and CRC counters were recorded. So we systematically disabled the multi-link trunk link by link until the stack interconnect stablised.

This reduced the trunk's capacity substantially but it also stabilised the network. So we added the 5530 back into the Stack downstairs, turned on the partner ports upstairs and were awarded with a 20 Gig backbone which is now operational at the Glasgow site.

As for the old LAG connection it was stripped out completely this morning and by early afternoon we had re-instated a 6 Gig connection to Stack01 which is working happily. From here we brought the site out of downtime and are back on the Grid.

We are putting in place an  internal tftp process for backing up switch configurations each night.

The main lesson from this is that on a large layer 2 environment, the smallest issue can become a major one and plans are well advanced on the next set of configuration changes to the network at Glasgow, to get around this and other potential issues in the future.






Thursday, July 14, 2011

We make knowledge possible

Just a quick Blog post in regards to the the WLCG workshop held at DESY in Hamburg from the 11th to 13th of July.
The various presentations covered aspects of all the experiments  and the future requirements for systems, storage, monitoring and networks.
Links to the workshop agenda and content can be found here:
https://indico.cern.ch/conferenceDisplay.py?confId=124407

Monday, July 11, 2011

Everyone's doing a brand new filesystem now: Come on, baby, do the cvmfs now.

Ever since I heard about it at CHEP 2010, I've been itching to get CVMFS set up at Glasgow, because it was so clearly a better solution for software provision than the old sgm-role / NFS-mounted area approach.
Concerns about the reliability of the hardware that the service was running on (it may still not be on production hardware at CERN as I write this) always held the more sensible minds here back, but now that it's all up and working at RAL, and RAL is providing a stratum-1 cache as a backup, there's nothing stopping us.

So, following a combination of Ian Collier's description of the set-up at RAL and the official CernVMFS technical report (pdf), with some adjustments to make changes to our Cfengine config, I spent some of last week getting cvmfs working on the cluster.

For your edification, this is what I did:

1) First, set up the new repository you need. In our case, yum repositories (and gpg keys) are managed by cfengine, so, in our cfengine skel directory for the worker nodes, I added:

wget http://cvmrepo.web.cern.ch/cvmrepo/yum/cernvm.repo -P ./skel/workers/etc/yum.repos.d/
wget http://cvmrepo.web.cern.ch/cvmrepo/yum/RPM-GPG-KEY-CernVM -P ./skel/workers/etc/pki/rpm-gpg/


2) Fuse and cvmfs both want to have user and group entries created for them. We manage users and groups with cfengine, so I added a fuse group to /etc/groups and a cvmfs user and group. The cvmfs user also needs to be added as a member of the fuse group. 

3) Now that the initial set-up bits are done, the new packages can be installed, again, using cfengine. I added the packages
fuse ; fuse-libs ; cvmfs ; cvmfs-keys ; cvmfs-init-scripts

to the default packages for our worker node class in cfengine.

4) Editing configuration files.
You need to edit auto.master to get autofs to support cvmfs.
(Just add a line like

/cvmfs /etc/auto.cvmfs
as the auto.cvmfs map is added by the cvmfs rpm.
Remember to issue a:
service autofs reload
afterwards, or get your configuration management system to do so automagically for you.
)
You also need to configure fuse to allow users to access things as other users:
/etc/fuse.conf
user_allow_other
And finally, you need to actually configure cvmfs itself. Cvmfs uses 2 main configuration files:
default.local, which specifies modifications of the default settings for the local install
cern.ch.local, which specifies modifications of the default server to use for *.cern.ch repositories.

/etc/cvmfs/default.local needs to be configured for:


CVMFS_USER=cvmfs
CVMFS_NFILES=32768
#CVMFS_DEBUGLOG=/tmp/cvmfs.log
CVMFS_REPOSITORIES=atlas.cern.ch,atlas-condb.cern.ch,lhcb.cern.ch,cms.cern.ch,geant4.cern.ch,sft.cern.ch
CVMFS_CACHE_BASE=/tmp/cache/cvmfs2/
CVMFS_QUOTA_LIMIT=10000
CVMFS_HTTP_PROXY="nameoflocalsquid1|nameoflocalsquid2"


/etc/cvmfs/cern.ch.local, for UK sites should probably be configured as:


CVMFS_SERVER_URL="http://cernvmfs.gridpp.rl.ac.uk/opt/@org@;http://cvmfs-stratum-one.cern.ch/opt/@org@"


(since RAL is closer to us than CERN).

A brief note: ';' in a list of options specifies failover, and '|' load-balancing. So "foo;bar" means "try foo, then bar", while "foo|bar;baz" means "try to load-balance queries between foo and bar, if that fails, try baz". This works for the squid proxy specifiers in default.local and also the server destinations in cern.ch.local .

Another note: the cache directory specified in default.local should be large enough to actually cache a useful amount of data on each worker node. 10Gb per VO is reported to be comfortably enough, for atlas and lhcb, and therefore is probably wildly exorbitant for any other VO that would be using it. I've tested, and you can happily set this directory to be readable only by the cvmfs user, which gives you a tiny bit more security.

If you change the configuration files for cvmfs, you need to get it to reload them, like autofs.

service cvmfs reload

seems to work fine (and our cfengine config now does this if it has to update those config files).

In our case, I created the two config files, stuck them in the skel directories for worker nodes in cfengine, and added them to the list of files that are expected to be on worker nodes in the config.



5 ) You can check that all this is working by trying a service cvmfs probe
or explicitly mounting a cvmfs path somewhere outside of automount's config.
With the default config, atlas software is at /cvmfs/atlas.cern.ch and so on.

Friday, July 01, 2011

A switch port too far

As part of the ongoing upgrades surrounding the recent issues that the CEs have had when communicating with svr016, we decided to upgrade the core backbone link to 20 Gigabits. Presently, we have one 10 Gigabit trunk link between 141 and 243d, which is occasionally saturating with traffic.

As previously posted, we disabled the 10 gigabit link into Stack01 and used the XFP GBIC recovered from it to facilitate this new link. Sam and I laid new fiber optic patch leads in both rooms to the patch panels and connected these to spare ports on the Core Dell 8024F and Stack02's 5530.

However, the link refused to come up. After several hours investigation we acquired a fiber optic line tester which proved that light was coming through the new link. We then tested the ports on both switches with a fiber optic loop.

While the port and GBIC in the 8024F looped correctly, you get a rather re-assuring green link light on the transmit and receive port, it failed on the port in Stack02. We retested the XFP in its old unit, stack01 and it came up correctly using the loop.

While we are using 62.5 um patch leads which, under the standards can't be driven as far as 50 um,  we thought this may have been the issue, we confirmed that this wasn't the case through the re-testing of all the components end to end with the fiber optic meter.

We cleaned out the interface slot on the Stack02 5530 with compressed air and isopropyl alcohol,  the port, while recognising the gbic correctly, did not bring up the link.  We fear that the on board optical interface is damaged, however we would need to put the site into downtime to confirm this, so we have come up with a Plan B.

As we have successfully built a LAG between 141 and 243d which is in place and did not impact service at all during its commissioning, and have laid in the fiber interconnect,  we have decided to investigate moving our second 5530 into Stack02 from Stack01 to give us the 20 Gigabit uplink that we require within the core of the network.

More on this after the move. 

As an aside, you never know how windy cold aisles are, until you lift a floor tile. Sam is on the floor in this image and not glued to the ceiling as his hair direction may imply.










And after studying its behaviour, objectively and critically, we believe we have a reliable method (With apologies to Neil Fallon)

Since the last post on the blog we have implemented a series of measures on the network which were planned to be deployed during the next Cluster refresh.

Primarily, we have migrated elements of our core servers such as svr020, svr001 and svr008 to the new Dell switch infrastructure and have introduced a series of Link Aggregation Groups (LAGS) across the Dell estate to raise their backbone to a full 20 Gigabits per second intra switch. This has led to the decommissioning of the core 10 Gigabit interconnect into our old Nortel gateway, stack01and this has been replaced with another LAG between the Dell's and stack01. The reason behind this will become clear in the next post.

The main upshot of this part of the network upgrade is that we now can have greater control over the network services and monitoring running out of these servers such as SNTP and Gangli respectively. These can be fine tuned to a greater degree on the Dell environment to minimise the broadcast and Layer 2 multicast impact of these services.

However, that is not to say that the Nortel's are on the way out quite yet. Our Torque and Maui Server, svr016, still resides on older Nortel equipment in Stack02 which is currently connected to the new Dell infrastructure by a 10 Gig fibre. This link is occasionally saturating; we have decided to upgrade the link to 20 Gigabits by running a new multimode fibre between the two computer rooms, 141 and 243d. We also decided to implement Layer 2 QOS for Server016 to ensure that it got priority over all other cluster traffic within the stack and through the core network switches.

Therefore, we embarked on the re-configuration on the QOS parameters on Stack02. The complexity behind this lies not in the actual end configuration: effectively the mac address of svr016 is tracked across VLAN's 1 and 2 respectively to ensure that a Gold Quality of Service is met for any device wishing to speak to or be spoken to by Svr016. The real complexity is implementing this so that you don't disable the entire cluster attached to the network stack.

Earlier implementations of the Nortel OS had a nasty tendency to drop all non-specified traffic within the network, and the QOS policy generation, while incredibly granular in its ability to tag and filter traffic, involves 6 different stages to ensure that traffic is correctly tagged and forwarded.

Added to the fact that if the MAC address do not have the correct MAC address mask  all traffic generated by Svr016 will be dropped, effectively disabling the cluster for a period of time, a general picture of the care required to implement this feature developed on our part.

Sam and myself rechecked the configurations twice before attempting to implement them. However, when we attempted to commit  we discovered that the Nortel GUI is a lot more thorough in its checks than we could ever have imagined. Due to a mis-configuration of the MAC address mask the system refused to commit it to the switches. It even supplied an error message which identified that the mask was wrong.

Once the mask had been corrected the configuration was loaded onto stack02 and immediately started to work. The image below shows the packet matching since the 30th of June 2011.




Now for the real test. How would it cope under increased DPM traffic loads?



Surprisingly well: it turns out as now all traffic to and from svr016 has a low drop status and high precedence value across the network.

The images below show the system performance during one of this recent event.










As can be seen, there is no real increase in activity now as the QOS mappings for svr016 now mean that, while it is still part of the production and external VLANs it always travels 1st class.

The next phase of QOS development is to start to investigate the corralling of network broadcasts for services such as NFS to see if we can reduce the background chatter on the network without impacting service.