Thursday, August 25, 2011

Busy Disks

After checking a test 10 gig Disk Server deployment we uncovered an interesting pattern in storage network activity and how our 10 Gig switch copes with multiply connections at 10 Gigabit. The captures below were taken over a 5 minute window of operation and show just how bursty the traffic patterns from these devices can be.

The graphs show all interfaces on our Dell 8024F and the measurement window is in Mbps. The order is top to bottom with the initial capture at the top.




While the Disk servers have been hammering away the round trip time intra room has been on average 0.40 msec between devices as the CPU on the core Dell seems more than happy to be handle these loads as its utilisation is approximately 20% presently.

We are planning to enable QOS metrics on disk server traffic shortly to test the response times on QOS and Non-QOS disk servers.


News Flash from ScotGrid Labs

In my last post, we investigating deployments of IPv6 on the test Cluster, the 1st one of which was using SLAAC to assign addressing to hosts. Interestingly enough it worked, first time out the tin.

An IPv6 Traceroute from the web is shown below:

traceroute to 2001:630:40:ef0:230:48ff:fe5a:4b7 (2001:630:40:ef0:230:48ff:fe5a:4b7), 30 hops max, 40 byte packets
 1  2001:1af8:4200:b000::1 (2001:1af8:4200:b000::1)  1.600 ms  1.813 ms  1.882 ms
 2  2001:1af8:4100::5 (2001:1af8:4100::5)  1.320 ms  1.392 ms  1.465 ms
 3  be11.crs.evo.leaseweb.net (2001:1af8::9)  2.587 ms  2.631 ms  2.619 ms
 4  linx-gw1.ja.net (2001:7f8:4::312:1)  8.475 ms  8.466 ms  8.453 ms
 5  ae1.lond-sbr4.ja.net (2001:630:0:10::151)  78.338 ms  78.388 ms  78.376 ms
 6  2001:630:0:10::109 (2001:630:0:10::109)  9.900 ms  9.479 ms  9.446 ms
 7  so-5-0-0.warr-sbr1.ja.net (2001:630:0:10::36)  13.320 ms  13.196 ms  13.317 ms
 8  2001:630:0:10::296 (2001:630:0:10::296)  18.705 ms  18.542 ms  18.793 ms
 9  clydenet.glas-sbr1.ja.net (2001:630:0:8044::206)  18.947 ms  18.931 ms  18.948 ms
10  2001:630:42:0:3e::9a (2001:630:42:0:3e::9a)  19.434 ms !X  18.214 ms !X  17.682 ms !X


The next phase of testing will be to enable a webserver to speak in both IPv4 and IPv6 using this access mechanism and then onto a Grid services .


I will post up a more detailed explanation of the mechanisms used for this soon.

Tuesday, August 23, 2011

Two Stacks are better than one

Leading on from the last post, we have also re-introduced a new test cluster. This infrastructure is housed within the same rack as our old worker nodes  but is completely independent of the production cluster. Supporting a Dell 8024F are 5 servers and a Dell 5000 series switch which are connected via an independent 1 gigabit fibre connection to the University's network.

The purpose of this cluster is to test IPv4/IPv6 dual stack connectivity for grid Services, the testing of switch based security mechanisms and SL6 NAT testing without fear of impacting the real cluster.

The IPv6 connectivity model testing will be in multiple phases which include:

* SLAAC
* IPv6 to IPv4 tunneling
* IPv6 Routing


This framework is designed to comply with the HEPIX IPv6 Project and to look at the possible connection models required by Tier-2s to utilise IPv6. Additionally, we will be testing a wide variety of Grid enabled applications and associated systems such as Nagios to investigate potential issues within a dual stack deployment.

More on this soon.

Night of the Return of the Living Worker Nodes

As Glasgow is currently being used as one of the sets for World War Z, we thought it only apt that we too resurrect the dead and get them to do our bidding. No, we haven't embraced "mad" science.

During the power work  we decided to alter the layout of 243d. Historically, the room had housed a mainframe including operators booths. One of these booths still existed within 243d, so we took down one of the walls and added a new cabinet.

While the work was being conducted to remove the wall we covered the cluster and powered it off to minimise dust ingestion. If you wish to gift wrap a cluster we have plenty of experience in this field. However, our wrapping is limited to blue plastic presently.



After the wall had been removed, we cleared out the computer room and re-organised the storage cabinets, cabling and computing cabinets. In 243d there were a pile of 6 year old disused worker nodes and racked worker nodes whose PDU had been damaged during one of our many power cuts over the last 12 months. In addition to this we found and rebuilt a Dell Rack and also we had a spare Nortel 5510 switch.




With the newly available space from the removal of the wall in 243d, we got a tile cut and deployed the rack. The rack connects back to the older Stack01 via a copper gigabit Ethernet connection. This deployment will give us up to approximately 100 job slots once they are fully configured.




Friday, August 12, 2011

Running at capacity again


... after the shutdown. Slightly delayed due to a coming back during a low point in Atlas work, which is now past us.

Here's a graph of data moved from our storage element, and you can probably pick out the rather subtle peak when the last batch of analysis traffic started (taking us up to capacity):


Wednesday, August 10, 2011

Power startup, situation (hopefully) normal

The planned power work in the Kelvin Building was completed this morning and we have been transferred back to our proper power feed from the generators. The power startup went smoothly and the building has returned to normal.

The Scotgrid cluster was restarted after the power was seen to be stable and we came out of downtime at 2.20 pm. We will monitor our situation, but we hope that this power work will improve our stability over the coming months.

Wednesday, August 03, 2011

Controlled Shut Down. Please standby.

As many regular readers of our blog may have noticed, we have had several power cuts over the last 8 months. While the Scot Grid Glasgow cluster has survived relatively well with these interruptions,  the School of Physics and Astronomy at the University of Glasgow has under taken a piece of work to resolve this re-current issue.

Therefore, on the 7th - 10th of August we will be going into a controlled downtime period so that the transformers which supply the mains feed into our site can be removed and upgraded.

We should be back in action on the morning of Wednesday the 10th.