tag:blogger.com,1999:blog-321894522024-03-13T22:10:48.562+00:00ScotGridGraeme Stewarthttp://www.blogger.com/profile/04113191724360870254noreply@blogger.comBlogger511125tag:blogger.com,1999:blog-32189452.post-78757309465284021642014-03-24T23:13:00.002+00:002014-03-24T23:13:20.192+00:00The Three Co-ordinatorsIt is has been a while since we posted on the blog. Generally, this means that things have been busy and interesting. Things have been busy and interesting.<br />
<br />
We are presently, going through redevelopment of the site, the evaluation of new techniques for service delivery such as using Docker for containers and updating multiple services throughout the sites.<br />
<br />
The development of the programme presented at CHEP on automation and different approaches to delivering HEP related Grid services is underway. An evaluation of container based solutions for service deployment will be presented at the next GridPP collaboration meeting later this month. Other evaluation work on using Software Defined Networking hasn't progressed as quickly as we would have like but is still underway.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiucpkfa-3M7rKFblo6EuzJ09jBNGDAZZw4b7AOk7i3RInxTqy8YKYm0TZ59SiwqdOXC36denb7CRbEA7qVFwpawXsJ_35MWAk681v_qpEcwf1uiqMmq6y19bUf6CYCKKSIQnrGtg/s1600/IMG_6526.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiucpkfa-3M7rKFblo6EuzJ09jBNGDAZZw4b7AOk7i3RInxTqy8YKYm0TZ59SiwqdOXC36denb7CRbEA7qVFwpawXsJ_35MWAk681v_qpEcwf1uiqMmq6y19bUf6CYCKKSIQnrGtg/s1600/IMG_6526.jpg" height="240" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Graeme (left), Mark (center) and Gareth.</td></tr>
</tbody></table>
<br />
On other news, Gareth Roy is taking over as the Scotgrid Technical Co-ordinator this month. Mark is off for adventures with the Urban Studies Big Data Group within Glasgow University.And as Dr Who can do it, we can do. Co-ordinator Past, Present and Future all appear in the same place at the same time. <br />
<br />
Will the fabric of Scotgrid be the same again?<br />
<br />
Very much so.<br />
<br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-32189452.post-79489542121137059982013-10-14T13:52:00.003+01:002013-10-14T13:53:07.092+01:00Welcome to CHEP 2013Greetings from CHEP 2013 in a rather wet Amsterdam.<br />
<br />
The conference season is upon us and Sam, Andy, Wahid and myself find ourselves in Amsterdam for <a href="http://www.chep2013.org/">CHEP 2013</a>. CHEP started here in 1983 and it is hard to believe that it has been 18 months since New York.<br />
<br />
As usual the agenda for the next 5 days is packed. Some of the highlights so far have included advanced facility monitoring, the future of C++ and Robert Lupton's excellent talk on software engineering for Science.<br />
<br />
As with all of my visits to Amsterdam, the rain is worth mentioning. So much so that it made local news this morning. However, the venue is the rather splendid Beurs van Berlage in central Amsterdam.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8om-lV3WO06O6EjVeR8UhDlOnsZVdi5NM1H-PeosjneS-4IevGSFXCVM73MN8CWKZ34MVtlyxfG96wSlm4hVZB0e7NjXfV0ZtMMcDrjzeVxugVru7UqL2EZ7qH-jwwCr87hxs-A/s1600/photo.JPG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8om-lV3WO06O6EjVeR8UhDlOnsZVdi5NM1H-PeosjneS-4IevGSFXCVM73MN8CWKZ34MVtlyxfG96wSlm4hVZB0e7NjXfV0ZtMMcDrjzeVxugVru7UqL2EZ7qH-jwwCr87hxs-A/s320/photo.JPG" width="240" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">CHEP 2013</td></tr>
</tbody></table>
<br />
<br />
There will be further updates during the week as the conference progresses.<br />
<br />
<br />
<br />
<br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-32189452.post-48170839490981960592013-10-14T13:44:00.000+01:002013-10-14T13:44:04.931+01:00Busy YearWe haven't posted a great deal this year as there has been a huge amount going on within Scotgrid since January.<br />
<br />
The main news this year has been Stuart's departure to Saint Andrews University from the Glasgow Scotgrid Team. Stuart's role within EGI, ROD and Grid Ops as well as the Glasgow site and his development work on MPI at Glasgow says a lot about his rather busy part-time role within Scotgrid. The word Factotum or "make everything" springs to mind when describing his input.<br />
We all wish Stuart the very best at Saint Andrews.<br />
<br />
The Glasgow site has suffered from issues with the coolant infrastructure since January. To mitigate this the University is upgrading both the power and air conditioning within the Kelvin Building. This work will include the installation of a Generator and UPS system as well as new air conditioning units. This is a long term project and will be completed by the summer of 2014.<br />
<br />
ECDF has performed incredibly well since January and Durham, while suffering from air con and power issues earlier in the year is now relatively stable.<br />
<br />
We have brought in additional VOs with the MVLS group at Glasgow University and are presently in discussions with other non-HEP groups such as bio-chemistry. The most technically challenging project is the proposed investigation into the Lairg Magnetic anomaly by the EarthSci group at Glasgow. This project is difficult due to the lack of network connectivity in the area where the data is being generated, we will report on this soon.<br />
<br />
Our primary focus of research, outside of running the sites, have covered GPU work at ECDF, more efficient data management and deployment strategies at Glasgow and ECDF. Additionally, how we utilise containerisation and build smarter cluster restart environments has been investigated by Gareth at Glasgow. David Crook's has done excellent work around aggregating the multiple monitoring platforms that have sprung up within the Grid by utilising the Graphite package.<br />
<br />
We have attended and presented in multiple conferences and public outreach events including one during the Edinburgh Festival. <br />
<br />
So that is up to date in time for CHEP. Which is on this week. Still trying to work out how quickly the last 18 months went. <br />
<br />
<br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-32189452.post-48125755475608776772012-12-25T21:23:00.002+00:002012-12-25T21:23:31.695+00:00A Merry Christmas and a Happy New Year to all our followers, users and co-workers from Scotgrid Glasgow.<br />
Keeping to a Physics theme, as always.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhCN51jtOiGgk72sykpMBGMi2gjOPNfhtjme9h-qZ9zre7DajwizUceMpqAdg2xT8hBh-R3JIwyk6BL9dB5lRed5blZhJ_PmcgfEmnc0ueVGWhZ1o07jKOEVC5DDq8C5QXxUBhyPg/s1600/563620_519283834759299_2115144667_n.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhCN51jtOiGgk72sykpMBGMi2gjOPNfhtjme9h-qZ9zre7DajwizUceMpqAdg2xT8hBh-R3JIwyk6BL9dB5lRed5blZhJ_PmcgfEmnc0ueVGWhZ1o07jKOEVC5DDq8C5QXxUBhyPg/s320/563620_519283834759299_2115144667_n.jpg" width="228" /></a></div>
See you all in 2013.<br /><br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-32189452.post-43096103894681430602012-12-06T18:58:00.001+00:002012-12-06T18:58:53.464+00:002012; A Grid Odyessy We haven't published on the blog since September this year which is a bit remiss of us.<br />
There are many reasons for this. Primarily we have been working through the final back log of the DRI grant until October. The expansion of the Glasgow site to 4000 cores, terra scale networking and changes to the disk farm have not been simple. Once the long standing issues with the internal data network were resolved with the upgrade to the Extreme Networks equipment, additional issues around the placement of data by DPM became evident. This was not a non trivial task to investigate. Stuart and Sam are in the process of developing a software patch for allowing a more sensible placement of data files within the cluster.<br />
<br />
In addition to this work, we are currently considering software and hardware changes to our data storage architecture in the new year. More of this in January.<br />
<br />
Again this year Glasgow has been plagued with infrastructure issues which have caused several major issues to the site's operation. We are now in a position where there is a major upgrade programme underway to deliver a more robust power, fire suppression and air conditioning system throughout the computer rooms.<br />
<br />
While these combined issues have caused a large number of issues the Glasgow site saw a return to 100 % availability and reliability metrics for November on the WLCG accounting earlier this week.<br />
Hopefully, this is how we will continue through the Christmas period and into 2013.<br />
<br />
As the end of winter is upon us with the winter solstice being just over 15 days away and Christmas following shortly behind it we would like to wish everyone a Merry Christmas and a Happy New Year for all of those at Scotgrid.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-32189452.post-798905930419948502012-09-03T16:00:00.001+01:002012-09-03T16:00:11.704+01:00All hands to the pumps, oh wait, that is GlycolUnfortunately on Friday the Air Con fairy visited Glasgow and due to a faulty pressure valve decided to sprinkle some magic in one of our plant rooms by dumping liquid coolant onto the floor. We took emergency action and closed down the cluster as the heat being generated in 141 was going above 30 degrees centigrade. The faulty equipment and associated devices were replaced. Thankfully, there hasn't been any damage to the equipment and we will be coming out of downtime and going back into production shortly.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-32189452.post-36643228811686743332012-09-03T15:54:00.004+01:002012-09-03T15:56:01.981+01:00The Higgs appearsThis year we haven't been great at keeping up with blog posts but there are many reasons behind this. We have installed a new network, additional cores taking us up to 4000 available job slots and have upgraded the server infrastructure throughout GU Scotgrid's cluster. Also, we have fallen foul of infrastructure issues and have had problems with the old and replacement Air Con systems. Slowly, we are extracting ourselves from these issues and recently Professor David Britton gave a lecture on the Grids role in the announcement made in July of this year at CERN during this years <a href="http://www.turingfestival.com/cern/">Turing Festival</a>. A surprise appearance at the event was Professor Higgs himself.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjFwMLaT2rNZpikChhLl0_TmAzOLJFwAAHWYthTVCE9kKnlLqyS9LFweQ7cx8lvUIM0DC07KcdYhrZfHaEY2ewH0sBlpEePMgz1OGhJ9apCREgP57r0fwvmUpy37T8ROEEM1lYkA/s1600/IMG_1537.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjFwMLaT2rNZpikChhLl0_TmAzOLJFwAAHWYthTVCE9kKnlLqyS9LFweQ7cx8lvUIM0DC07KcdYhrZfHaEY2ewH0sBlpEePMgz1OGhJ9apCREgP57r0fwvmUpy37T8ROEEM1lYkA/s320/IMG_1537.jpg" width="240" /></a></div>
<div style="text-align: center;">
Professors David Britton (left) and Professor Peter Higgs</div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
Also presenting at the event were Professors Tejinder Singh Virdee and John Ellis of Imperial College and Dr Ben Segal from CERN. The event was one of the kickstart activites for the Turing Festival and enabled the public and academics to get a better over view of what has been involved in getting the experiments this far.</div>
<div style="text-align: left;">
<br /></div>
<br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-32189452.post-64682011574966262262012-08-03T23:01:00.000+01:002012-08-03T23:01:20.822+01:00Scotgrid CallingIt has been a while since we last updated the blog which is generally a sign of being busy. Unfortunately we have encountered several infrastructure issues recently which needed to be repaired. Predominantly these revolved around the air conditioning units on the roof of the Kelvin Building. This work was completed a few weeks ago but as one thing has been fixed another issue identified itself in the form of a failing Air Handling Unit in 141. The knock on effect of this is that we can't take full advantage of the cluster servers located in the room and the overall cluster is running at two thirds capacity presently.<br />
<br />
While, these events are less than optimal it has allowed us to plan the next set of cluster upgrades which will introduced another 256 job slots into the cluster and due to the new resilient network fabric we have developed the deployment of these services is no longer limited to one room supporting 10 gig interfaces.<br />
<br />
Other developments also include the re-introduction of an independent control network and a new WAN testing platform Perfsonar. We will blog about this seperately shortly.<br />
<br />
<br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-32189452.post-58591610116406836852012-05-24T16:38:00.001+01:002012-05-24T16:38:07.096+01:00CHEP UpdateAs we are into the 4th Day of CHEP a quick overview of the activities of Scotgrid, GridPP and the conference as a whole is now in order.<br />
<br />
We have presented our posters and have generated interest in Storage, Job failures, Network Security and some of the work we have conducted with IPv6. Several potential collaborations with other sites and developers have resulted from these presentations. Andy and Wahid had several successful talks and there was a high volume of interest on the work being discussed.<br />
<br />
From a GridPP perspective Chris Walker's poster on using Lustre for low cost petascale storage also generated a large volume of interest. Talks given by other members of the collaboration were equally well received. <br />
<br />
The conference itself has covered the multiple developments within the field over the last 12 - 18 months with presentations investigating a variety of topics including federations for data, the future of CPUs/GPU, ultra high speed networking and common software architectures for the experiments.<br />
<br />
The variety of techniques being deployed and approaches taken to Grid centric problems are always of interest.<br />
<br />
<br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-32189452.post-12450748477227987982012-05-19T02:35:00.003+01:002012-05-19T02:35:32.151+01:00Scotgrid in the Big Apple for CHEP<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjehyphenhyphenYablLpnOg4kV6fnqIStK3LUoTyKcB7ZYs9AyOIHU_lOlFVS2-yujwfRMVQfpJZ50-TjkJKkKWZdYhayJDV2MQgc6BQiqn1ec0o7eTG3QB_ET_clLaTBu6X4ykRkOlXC-OUsw/s1600/IMG_1058.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjehyphenhyphenYablLpnOg4kV6fnqIStK3LUoTyKcB7ZYs9AyOIHU_lOlFVS2-yujwfRMVQfpJZ50-TjkJKkKWZdYhayJDV2MQgc6BQiqn1ec0o7eTG3QB_ET_clLaTBu6X4ykRkOlXC-OUsw/s320/IMG_1058.jpg" width="240" /></a></div>
<br />
We are attending the WLCG and CHEP in New York this week. There will be regular updating of the blog with details of the talks and papers we are attending.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-32189452.post-68103845061266603902012-05-07T16:38:00.000+01:002012-05-07T16:38:06.088+01:00Stockholm LHCONE Meeting<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgYYcmn7YKKkVHXSwHDrx1wOcGA8t9ni6lOCFCo5DnHTBkH3ne13ZZJAaF__L0yI2hPk7IowVHwRopKfU1bfk46tG-y9ZYGH8EyeE7wxBb2HEnzssFXSx5MRiuM7rgErv2ARfA5Iw/s1600/IMG_0550.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgYYcmn7YKKkVHXSwHDrx1wOcGA8t9ni6lOCFCo5DnHTBkH3ne13ZZJAaF__L0yI2hPk7IowVHwRopKfU1bfk46tG-y9ZYGH8EyeE7wxBb2HEnzssFXSx5MRiuM7rgErv2ARfA5Iw/s320/IMG_0550.jpg" width="240" /></a></div>
<div style="text-align: center;">
Kristall in the Sergels Torg Stockholm</div>
<br />
<br />
We were in attendance at the <a href="http://lhcone.net/">LHCONE</a> meeting at <a href="http://www.kth.se/en">KTH</a> in Stockholm last week. The purpose of this collaboration is to investigate the efficient use of networks globally for LHC research. As usual it was an excellent meeting where the technical mechanisms for current and future network deployments were discussed and considered.<br />
<br />
The agenda can be found <a href="http://indico.cern.ch/conferenceDisplay.py?confId=179710">here</a>. Some of the highlights of the meeting included an excellent presentation by <span class="" itemprop="performers" itemscope="" itemtype="http://schema.org/Person"><span itemprop="givenName">Erwin</span> <span itemprop="lastName">Laure on the Swedish and Scandinavian Super Computing and Grid computing infrastructure, </span></span><span class="" itemprop="performers" itemscope="" itemtype="http://schema.org/Person"><span itemprop="givenName">Joe</span> <span itemprop="lastName">Mambretti's presentation on the <a href="http://www.gloriad.org/gloriaddrupal/">GLORIAD</a> global research network, Mike O'Connor's discussion on the technical configurations required to avoid asymmetric routing issues between the LHCONE and the current production networks and </span></span>
<span class="" itemprop="performers" itemscope="" itemtype="http://schema.org/Person"> <span itemprop="givenName">Domenico</span> <span itemprop="lastName">Vicinanza's presentation on<a href="http://www.geant.net/Services/NetworkPerformanceServices/Pages/perfSONARMDM.aspx"> Perfsonar MDM.</a></span></span><br />
<br />
<span class="" itemprop="performers" itemscope="" itemtype="http://schema.org/Person"><span itemprop="lastName">In addition to these presentations technical discussions surrounding various technologies surrounding bandwidth reservation, ultra high speed networking and Open Flow technologies were held. As these discussions develop through the network architecture groups we will keep you up to date. </span></span><br />
<span class="" itemprop="performers" itemscope="" itemtype="http://schema.org/Person"><span itemprop="lastName"><br /></span></span><br />
<span class="" itemprop="performers" itemscope="" itemtype="http://schema.org/Person"><span itemprop="lastName">Also, the weather in Stockholm was exceptional and the KTH Campus is worth a visit for its architecture alone. I would like to thank our hosts and all the other attendees for making this such an enjoyable and informative couple of days.</span></span><br />
<span class="" itemprop="performers" itemscope="" itemtype="http://schema.org/Person"><span itemprop="lastName"><br /></span></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnwJBVAW9ipgyfhslD5jBem-0ogYal1rGS-j8CpqA6utd1S14AMTcPHZN6e-JHmbBoGNsWz6T4R45dDvao96BrJ1VFck_9m2K10qwl-UO0mABcYemFCzEhyrqEj-Z3F0yyRy6RmQ/s1600/IMG_0572.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnwJBVAW9ipgyfhslD5jBem-0ogYal1rGS-j8CpqA6utd1S14AMTcPHZN6e-JHmbBoGNsWz6T4R45dDvao96BrJ1VFck_9m2K10qwl-UO0mABcYemFCzEhyrqEj-Z3F0yyRy6RmQ/s320/IMG_0572.jpg" width="320" /></a></div>
<span class="" itemprop="performers" itemscope="" itemtype="http://schema.org/Person"><span itemprop="lastName"><br /></span></span><br />
<div style="text-align: center;">
<span class="" itemprop="performers" itemscope="" itemtype="http://schema.org/Person"><span itemprop="lastName"> KTH Campus Stockholm</span></span></div>
<span class="" itemprop="performers" itemscope="" itemtype="http://schema.org/Person"><span itemprop="lastName"> </span></span><span class="" itemprop="performers" itemscope="" itemtype="http://schema.org/Person"><span itemprop="lastName"> </span></span>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-32189452.post-84154822792218983702012-05-06T11:27:00.001+01:002012-05-07T16:38:25.654+01:00Preparing for IPv6Generally, we don't repost news items on the blog but this <a href="http://www.bbc.co.uk/news/technology-17938580">BBC</a> article gives a good indication of the changes underway globally for implementing IPv6.
currently the Glasgow Scotgrid test cluster is being revamped post our last spending cycle and we are embarking on a full test programme of IPv6 specifically around running Grid services.
As this work progresses we will regularly update the blog.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-32189452.post-3175300130455673162012-05-03T15:50:00.000+01:002012-05-07T16:39:59.778+01:00GridPP At The Top Of EuropeThis news article appeared on the GridPP website and is worth reposting to our blog as it gives an overview of the collaborations efforts to date within the WLCG and with the Non High Energy Physics (HEP) communities.<br />
<a href="http://www.gridpp.ac.uk/news/?p=2273" target="_blank">GridPP At The Top Of Europe</a><br />
<br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-32189452.post-18389456654694829672012-04-17T15:41:00.000+01:002012-05-07T16:40:37.823+01:00One XOS, Great Big Purple Packet Eater. Sure looks good to me.So we haven't been blogging a great deal since December and for good reason. We found ourselves in the exciting position of being given additional funding to enhance our network capability and also we had additional equipment to install into the cluster.<br />
<br />
First things first however, as you may have read we have had no end of issues with the older network equipment. We had a multi-vendor environment which, while adequate for 800 analysis jobs and 1200 production jobs, wasn't quite up to cutting the mustard as we couldn't expand from there.<br />
<br />
The main reason was the 20 Gig link between the two computing rooms which was having real capacity issues. Also, add in issues between the Dell and Nortel LAG and associated back flow problems, sprinkled with a buffer memory issue on the 5510s and you get the picture. In addition to this we were running out of 10 Gig ports and therefore couldn't get much bigger without some investment.<br />
<br />
Therefore, the grant award was a welcome attempt to fix this issue. After going to tender we decided upon equipment from <a href="http://www.extremenetworks.com/">Extreme Networks</a>. The proposed solution allowed for a vast 160 Gigabit interconnect between the rooms broken into two resilient link bundles in the Core and an 80 Gigabit Edge layer. In addition to this connection we also installed a 32 core OM4 grade fiber optic network for the cluster which will carry us into the realms of 100 Gigabit connections, when it becomes available and cheap enough to deploy sensibly.<br />
<br />
We now have 40 x 40 gigabit port, 208 x 10 gigabit ports and 576 1 x Gigabit ports available for the Cluster.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgzgI2BJieK-bEZEWalBoiTBm5Ppnx-pv6W1M8SmCbllxGMQuY6ul8ZACpG_HvXsy6ldOjxJLLEf9XHDmyk647nWwzL1zCuqSWA48dg5Lkm4bbVFlAUp811Cq9XF8IinB2nStqVLQ/s1600/IMG_0467.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgzgI2BJieK-bEZEWalBoiTBm5Ppnx-pv6W1M8SmCbllxGMQuY6ul8ZACpG_HvXsy6ldOjxJLLEf9XHDmyk647nWwzL1zCuqSWA48dg5Lkm4bbVFlAUp811Cq9XF8IinB2nStqVLQ/s320/IMG_0467.jpg" width="320" /></a></div>
<div style="text-align: center;">
There is quick and clever and here it is</div>
<br />
The new deployment utilises <a href="http://www.extremenetworks.com/products/summit-x670.aspx">X670</a>s in the Core and <a href="http://www.extremenetworks.com/products/summit-x460.aspx">X460s</a> at the Edge.<br />
<br />
The magic of the new Extreme Network is that it uses EAPS, so bye bye Spanning Tree and good riddance as well as MLAG which allows us to load share traffic across the two rooms so having 10 Gigabit connections for disk servers in one room is no longer an issue.<br />
<br />
Then it got a bit better. Due to the Extreme OS we can now write scripts to handle events within the network which ties in with the longer term plan for a Cluster Expert System (ARCTURUS) which we are currently designing for test deployment. More on this after August.<br />
<br />
Finally, it even comes with its own event monitoring software, Ridgeline which gives a GUI interface to the whole deployment.<br />
<br />
We stripped out the old network installed the new one and after some initial problems with the configuration, which were fixed in a most awesome fashion by Extreme got the new one up and running. What we can say is that the network isn't a problem anymore, at all.<br />
<br />
This has allowed us to start to concentrate upon other issues within the Cluster and look at the finalised deployment of the IPV6 test cluster which has benefited in terms of hardware from the new network install. Again, more on this soon.<br />
<br />
Right, so now to the rest of the upgrade we have also extended our cold isle enclosure to 12 racks, have a secondary 10 Gig link onto the Campus being installed and have a UPS. In Addition to this we refreshed our storage using Dell R510s and M1200s as well as buying 5 Interlagos boxes to augment the worker node deployment.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFlOzxQ1ZvHSVkxo2MlE4vy0NPj-6CVcyy1t9PcgsVR9G3fuVxwtZqO2qN8qoYwJoRrfDZ2Narhw0xR1nLS_bvyMuvx3IWf6crywRDxMrpDbGWUb0R1HL1mtKfZ6o01EYH6PxZrw/s1600/IMG_0466.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFlOzxQ1ZvHSVkxo2MlE4vy0NPj-6CVcyy1t9PcgsVR9G3fuVxwtZqO2qN8qoYwJoRrfDZ2Narhw0xR1nLS_bvyMuvx3IWf6crywRDxMrpDbGWUb0R1HL1mtKfZ6o01EYH6PxZrw/s320/IMG_0466.jpg" width="320" /></a></div>
<div style="text-align: center;">
The TARDIS just keeps growing</div>
<br />
We also invested in an experimental user access system with wi-fi and will be trying this out in the test cluster to see if a wi-fi mesh environment can support a limited number of grid jobs. As you do.<br />
<br />
In addition to this we improved connectivity for the research community in PPE at Glasgow and across the Campus as a whole, with part of the award being used to deliver the resilient second link and associated switching fabrics.<br />
<br />
It hasn't been the most straight forward process as the decommissioning and deployment work was complex and very time consuming in an attempt to keep the cluster up and running as long as possible and to minimise down times.<br />
<br />
We didn't quite manage this as well as expected due to the configuration issues on the new network but we have now upgraded the entire network and have removed multiple older servers from the cluster to allow us to enhance the entire batch system for the next 24 - 48 months.<br />
<br />
As we continue to implement additional upgrades to the cluster we will keep you informed.<br />
For now it is back to the computer rooms.<br />
<br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-32189452.post-37760346669248893992012-02-27T15:31:00.003+00:002012-02-27T16:20:26.710+00:00LSC files and emailAddress reduxThis post involves a very complicated journey to get to a simple place.<br /><br />The fundamental problem is around the catchy titled OID 1.2.840.113549.1.9.1<br /><br />No, wait, let me take a step back. On the Grid, we use certificates for authentication. An X509 certificate is, as with most certificates, a signed set of assertions, and a public key. As with the rest of the X500 standards, it's native language is something called ASN.1 (Abstract Syntax Notation 1) (aka X208, and the later revision X680), held in files encoded by the DER (Distinguished Encoding Rules).<br /><br />The fundamental takeaway from that tech-dump is that X509 certificates are not in plain text, and there are multiple standards required in order to understand their contents.<br /><br />So when someone says their certificate Distinguished Name is '/O=SomeUni/OU=SomeDept/L=group/CN=JohnSmith' ... that's not quite accurate. What they really mean is that there certificate DN is some set of objects that can be unambiguously matched to that ASCII text.<br /><br />That happens because there are universally agreed mappings between the actually stored OID and the text representation of them (e.g. CN is OID 2.5.4.3).<br /><br />Unfortunately, the agreement breaks down a bit for the emailAddress field; with some software mapping it to Email, and others to emailAddress. By the PKCS#9 standard, one could argue that it should be emailAddress - but that doesn't help us get software working.<br /><br />Fortunatly, all of this is not a problem unless we want to store certificate DN's in ASCII, _and_ want to have email addresses in the DN.<br /><br />Yeah, you can see where this is going, can't you?<br /><br />In the UK, <a href="http://www.nationalgridservice.blogspot.com/2012/02/on-email-addresses-in-distinguished.html">Jens has been working</a> to allow us to not have them in DN's. However, in the short term, they are present.<br /><br />One particular case where ASCII representations of the DN are used is in LSC files - which are used to authenticate VOMS servers. What happens is if the VOMS server DN matches the DN in the LSC file, and the cert was signed by the CA DN in the LSC file, _and_ the certificate chain is signed by a trusted root, then it's valid. This process means that we don't need to distribute lots of VOMS server certs, just the root CA's, and a small note (that shouldn't change over renewals) of the server DN.<br /><br />I've been tidying up our ARC install here, and during the process managed to break things. Not unusual for me, (one of the reasons I avoid tiding at all costs!), but this one was quirky. I'd put the vomsdir under CFEngine control, so that it was sync'd with all the other servers, and suddenly it stopped accepting the scotgrid VO.<br /><br />Root cause, as if you can't guess by now, LSC file, and the emailAddress. Looks like the gLite stack expects it one way, and ARC the other. Of course, by the time you read this, that's probably been fixed somewhere, but not in the version we had installed.<br /><br />It turns out that there's one trick in LSC files that saves this case. Let me put the LSC file in here:<br /><br /><blockquote><pre>/C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=svr029.gla.scotgrid.ac.uk/Email=grid-certificate@physics.gla.ac.uk<br />/C=UK/O=eScienceCA/OU=Authority/CN=UK e-Science CA<br />------ NEXT CHAIN ------<br />/C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=svr029.gla.scotgrid.ac.uk/emailAddress=grid-certificate@physics.gla.ac.uk<br />/C=UK/O=eScienceCA/OU=Authority/CN=UK e-Science CA<br /></pre><br /></blockquote><br /><br />The 'NEXT CHAIN' line lets one put multiple entries in the file. However, it appears that ARC isn't reading multiple, only the first one. So, in this case, I put the ARC friendly one first, so it matches fine - and the gLite stack tries again, finds the second, and thus suceeds.<br /><br />Imporant notes: I can't find anyone else with a field report of NEXT CHAIN working in the gLite stack. This is such a field report. It doesn't appear to work with ARC.Stuart Purdiehttp://www.blogger.com/profile/08473287949581285669noreply@blogger.com0tag:blogger.com,1999:blog-32189452.post-29141776160671360792011-12-21T10:00:00.007+00:002011-12-21T22:11:09.372+00:00Batch system jugglingWe've been a bit quiet up here recently. This is normally a sign of either nothing interesting happening, or entirely too many interesting things happening. Opinions on that may divide, but I think it's closer to the latter...<br /><br />One of the recent bits of fun that occurred was with our batch server. This story actually starts a long time ago; about this time last year. At that point, we started to get intermittent memory errors from the Torque server - corrected by ECC - but that's generally a sign that the RAM's about to fail. Given that the batch server is single point of failure for a site, that's not a good thing.<br /><br />So I spent some time preparing a spare box, and being ready to move the batch system over, in case it failed over the winter break. Which, after all that prep, it didn't, and the errors stopped. On the expectation that the current hardware was nearing end of life, we ordered a new box early this year, and have had it sitting in a machine room for a while.<br /><br />Unfortunately we didn't get time to have it running a tested batch system until our power supply started to ... well, insert colourful metaphor here, describing the 8 months where we were affected by lack of power.<br /><br />Power got to stable supply in September, and so to catch up on things. One of the things we got around to was software versions. Whilst we didn't intent to update the Torque version, and managed to avoid it for a bit, the gLite developers eventually managed to sneak the update past us as part of an ordinary gLite update. Strictly, this didn't affect the batch server, just all the CE's, making them incompatible with the previous version of Torque.<br /><br />Whilst a clever manoeuvre, reminiscent of Odysseus' Pony, it did leave us with a conundrum of either reverting the gLite update, or running forward with it. Neither were options of good character, but running forward did have some actual documentation; hence it was full speed ahead.<br /><br />Which worked out well enough. The Torque 2.5.7 packages were set to use Munge, so getting that installed and tested as a first step helped it go smoothly. To preserve compatability in file locations, we used /etc/sysconfig/pbs_mom to put the pbs working directories in the same place as previously - meaning we didn't have to reconfigure any other tools.<br /><br />What didn't go so smoothly was the memory leak in the server.<br /><br />Which gave it a runtime of around 36 hours between crashes. Actually, not even crashes - we found that the pbs_server process hit either<br /><br /><blockquote><br />12/05/2011 10:19:12;0080;PBS_Server;Req;req_reject;Reject reply code=15012(PBS_Server System error: No child processes MSG=could not unmunge credentials), aux=0, type=AlternateUserAuthentication, from tomcat@svr021.gla.scotgrid.ac.uk<br /></blockquote><br />or<br /><blockquote> <br />10/29/2011 18:11:24;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Cannot allocate memory (12) in send_job, fork failed<br /></blockquote><br /><br />and then sat around moaning. Had it crashed hard, then the auto-restart would have caught it. Ho, hum, one for the Fast Fail philosophy there.<br /><br /><br />By this point, my proof reader is pointing out that I started off talking hardware, and now talking software. Punchline is that the new server that we never got a chance to use has a lot more RAM than the old server. Therefore we wanted to move the server from the old hardware to the new, to give it a lot more RAM space. That won't fix the memory leak, but will mitigate the problem a bit.<br /><br />Conventionally, this would involve draining the cluster, repositioning the CE's and then starting up everything again. Had we done that, this blog post would be over now.<br /><br />Instead, we did a rolling update. This let us move things over without having to do a full drain. The biggest problem with a full drain is that, while most of the jobs finish within a shorter period of time that then limit, there are always some that take the full duration. This leaves us with an empty cluster, doing nothing, for 24 hours or so, wainting on a couple of jobs to finish.<br /><br />So, instead, by moving things in small batches, then we can keep most of the nodes working, and thus get more work out of things. Step zero is to disable cfengine, otherwise it tends to try and 'fix' things part way through.<br /><br />Step one is to drain a CE, which we did over a weekend, and a small number of nodes, which we put offline on the Sunday morning.<br /><br />Come Monday, I set up and tested basic operations with the new batch server, and then moved the freed up nodes across to it. Once those were tested (which shook out a couple of issues about versioning of some libs), point the CE at the new batch server, and then run a test job though it. (It turns out that Atlas are fast enough to sneak some pilots through a 2 minute window for a test job. However, only a few, so they actually functioned as effective tests, without compromising the site if they failed).<br /><br />After that, it's time to offline another CE, and then some more nodes, and start moving nodes over when they were empty. In the end I scripted this:<br /><br /><blockquote><br />#!/bin/sh<br /><br />NODE=$1<br />RUNNING=$(qstat -n -1 | grep $NODE | wc --lines)<br /><br />if [ "x${RUNNING}" != "x0" ]<br /> then<br /> echo $NODE: Still $RUNNING jobs going, skipping<br /> exit 2<br />fi<br /><br />CORES=$(qmgr -c "print node ${NODE}" | grep "np = " | cut -d= -f2)<br /><br />FROM=svr666<br />TO=svr999<br /><br />echo $NODE: Moving to ${TO} with ${CORES} cores<br /><br />ssh ${TO} "~/addNode.sh ${NODE} ${CORES}"<br /><br />ssh ${NODE} "service pbs_mom stop"<br />scp config.mom.svr666 ${NODE}:/var/spool/pbs/mom_priv/config<br />ssh ${NODE} "service pbs_mom start"<br /><br />ssh ${FROM} "~/deleteNode.sh ${NODE}"<br /></blockquote><br /><br />In theory one can run qmgr remotely, rather than ssh-ing to the batch servers and running a script. In practice, with the different versions of Torque, I couldn't get that to work. Note the automation of the mom config switch as well; and that this script checks that the node is empty.<br /><br />This reduced the gradual move of nodes to a process of croning the script, and offlining nodes occasionally.<br /><br />The net result was that we were operating at around 80% capacity for 48 hours, and it was all rather uneventful - in a good way. The final step was to update cfengine config and re-enable it.<br /><br />One of the plus points of the above script is that it should be simple to adapt to two distinct batch systems; which means if we end up moving away from Torque, we should be able to do that without downtime too.Stuart Purdiehttp://www.blogger.com/profile/08473287949581285669noreply@blogger.com0tag:blogger.com,1999:blog-32189452.post-82329980473289528432011-09-23T14:19:00.003+01:002011-09-23T14:21:27.952+01:00Leaving LyonThe EGI Tech Forum is winding down, with only a few talks remaining. It's been a great meeting, with a wide range of talks on all areas of Grid Computing. Lots to think about and new ideas to try out!David Crookshttp://www.blogger.com/profile/07412551479798045933noreply@blogger.com0tag:blogger.com,1999:blog-32189452.post-43530001193679155862011-09-21T13:01:00.003+01:002011-09-21T13:02:40.063+01:00Scotgrid goes SouthLast we week attended the bi-annual GridPP Collaboration meeting.<br />
The venue this time was CERN itself and the meeting was, as ever, incredibly useful.<br /><br />We were lucky enough to have presentations from the Experiments, the LHC, EGI and the WLCG community as well as presentations from across the UK collaboration.<br /><br />A full programme of the meeting is available here:<br />
<br />
http://www.gridpp.ac.uk/gridpp27/<br />
<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg2GhasLccBcdw1NOidR6_Lp4Cb2rA4VGcjTOH3WiK42DvXHvrMcdxo0mRs9YaaSDqnaorAn_0AmCBhg2bGzbCjK-dnHAz4LtmNW7Wa3JesJgN5mYrYcEbazQhpMR6hsLwO0_laqw/s1600/DSC01427.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="205" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg2GhasLccBcdw1NOidR6_Lp4Cb2rA4VGcjTOH3WiK42DvXHvrMcdxo0mRs9YaaSDqnaorAn_0AmCBhg2bGzbCjK-dnHAz4LtmNW7Wa3JesJgN5mYrYcEbazQhpMR6hsLwO0_laqw/s320/DSC01427.JPG" width="320" /></a></div>
Above is a picture of our own Dr Crooks presenting on the Glasgow Security ModelUnknownnoreply@blogger.com0tag:blogger.com,1999:blog-32189452.post-88521987261542636132011-09-19T07:34:00.004+01:002011-09-19T07:46:40.282+01:00EGI Tech Forum 2011<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQ3FO8CAsx9PdqjA2-UvJtBHbonctEWEC6iTSz4pDLCza7rb4MoFiZgH9Y_mTjhnEqgF3yM0CgpNnvcRVLvcUBbHi0u04yP4G3303Ux7wwv60qsK78xtRQJmJ6dZbSHhI29xOX/s1600/IMG_0040+%25281%2529.jpg" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 320px; height: 240px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQ3FO8CAsx9PdqjA2-UvJtBHbonctEWEC6iTSz4pDLCza7rb4MoFiZgH9Y_mTjhnEqgF3yM0CgpNnvcRVLvcUBbHi0u04yP4G3303Ux7wwv60qsK78xtRQJmJ6dZbSHhI29xOX/s320/IMG_0040+%25281%2529.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5653958363768671234" /></a><i>Bonjour Lyon!</i><div><br /></div><div>After last week's GridPP 27 meeting in CERN, this week we are in Lyon for the 2011 EGI Tech Forum, running from Monday until Friday this week. You can follow the Forum online using some of the links <a href="http://tf2011.egi.eu/media_room/index.html">here</a>.</div><div><br /></div><div>More later - time now to find some coffee before the first session...</div>David Crookshttp://www.blogger.com/profile/07412551479798045933noreply@blogger.com0tag:blogger.com,1999:blog-32189452.post-67710857772785090632011-08-25T17:59:00.000+01:002011-08-25T18:00:35.817+01:00Busy DisksAfter checking a test 10 gig Disk Server deployment we uncovered an interesting pattern in storage network activity and how our 10 Gig switch copes with multiply connections at 10 Gigabit. The captures below were taken over a 5 minute window of operation and show just how bursty the traffic patterns from these devices can be.<br />
<br />
The graphs show all interfaces on our Dell 8024F and the measurement window is in Mbps. The order is top to bottom with the initial capture at the top.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEivVdKQ-HhZsdcb984Bw72OCoV7eUPGVc5NHI490LKgW1A14y2aeP5FR7ZbZzMHROthKkWa2hi23NccKe4WG4nLnkaXZ57imVlYqtEeg9lwoRwxc-CyM2ji70B1NELb0__6gBBE_A/s1600/1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="168" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEivVdKQ-HhZsdcb984Bw72OCoV7eUPGVc5NHI490LKgW1A14y2aeP5FR7ZbZzMHROthKkWa2hi23NccKe4WG4nLnkaXZ57imVlYqtEeg9lwoRwxc-CyM2ji70B1NELb0__6gBBE_A/s320/1.jpg" width="320" /></a></div>
<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYlxHgodE7bwJuxL6nH2whs0hggPLIcRwgtuOxX1gExYt2NoM9RMrpPWM77Ab3NTrrwtS-VqAXOZNNv9bG8PsSsnGnN6OKx2zu98foFh9wnXnTUartHj3TEZkOxG7hZCyjAcQskg/s1600/2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="181" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYlxHgodE7bwJuxL6nH2whs0hggPLIcRwgtuOxX1gExYt2NoM9RMrpPWM77Ab3NTrrwtS-VqAXOZNNv9bG8PsSsnGnN6OKx2zu98foFh9wnXnTUartHj3TEZkOxG7hZCyjAcQskg/s320/2.jpg" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi15Ta3RUwQ65M7ZzyrXgQha9NrdIpMdLCKS4t7Pc_5dIGnIyWIJt51N540EwAJNKe_WWFr8zWS7E2O7RWcX3EX0q2PlD5K2iUB_RtFbA9fZXP-9UU_rqfnqRoRRA7kHAwIqHhWWA/s1600/3.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="194" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi15Ta3RUwQ65M7ZzyrXgQha9NrdIpMdLCKS4t7Pc_5dIGnIyWIJt51N540EwAJNKe_WWFr8zWS7E2O7RWcX3EX0q2PlD5K2iUB_RtFbA9fZXP-9UU_rqfnqRoRRA7kHAwIqHhWWA/s320/3.jpg" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgs6fVpGUV9welKIoM6w02pX5bVmFBEL1nSvBcV81OPt13Un9isLoy07DH0VsJ20wHr4HprTyHGIYXEIH0XQVp6CB3qcV_fB50mPA1ir34OnjcCXKlCHjGSNvV5eAfscDlBE-wn3g/s1600/4.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="206" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgs6fVpGUV9welKIoM6w02pX5bVmFBEL1nSvBcV81OPt13Un9isLoy07DH0VsJ20wHr4HprTyHGIYXEIH0XQVp6CB3qcV_fB50mPA1ir34OnjcCXKlCHjGSNvV5eAfscDlBE-wn3g/s320/4.jpg" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi9jkKDQU6K5wkvAf9imLfjZZJ_XM3GOFSYhVCMuMaAZuKVlXjEfn5ejpAtKPTrABxWmhsjv7HpC1pi0Xjoakhb608s5d_F9ZsPYvdopQIGXlNRd0X8xWHTMddEhMDI-ZVJGoUtFg/s1600/5.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="176" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi9jkKDQU6K5wkvAf9imLfjZZJ_XM3GOFSYhVCMuMaAZuKVlXjEfn5ejpAtKPTrABxWmhsjv7HpC1pi0Xjoakhb608s5d_F9ZsPYvdopQIGXlNRd0X8xWHTMddEhMDI-ZVJGoUtFg/s320/5.jpg" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgybiOPF775T0hV9oeJ8AhR81C82mUjbVtUr0hyphenhyphenutesaccHaFBGZFJdyfTjPwOMad4G54REoT5qGjjelevVLctSMaxq1SVqmTDFH2rKrdJGd6bIKQvK7DwzHuMneMfEVnicr8qGgA/s1600/6.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="196" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgybiOPF775T0hV9oeJ8AhR81C82mUjbVtUr0hyphenhyphenutesaccHaFBGZFJdyfTjPwOMad4G54REoT5qGjjelevVLctSMaxq1SVqmTDFH2rKrdJGd6bIKQvK7DwzHuMneMfEVnicr8qGgA/s320/6.jpg" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
While the Disk servers have been hammering away the round trip time intra room has been on average 0.40 msec between devices as the CPU on the core Dell seems more than happy to be handle these loads as its utilisation is approximately 20% presently.<br />
<br />
We are planning to enable QOS metrics on disk server traffic shortly to test the response times on QOS and Non-QOS disk servers.<br />
<br />
<br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-32189452.post-50861105919517694782011-08-25T17:22:00.000+01:002011-08-25T17:23:09.140+01:00News Flash from ScotGrid LabsIn my last post, we investigating deployments of IPv6 on the test Cluster, the 1st one of which was using SLAAC to assign addressing to hosts. Interestingly enough it worked, first time out the tin.<br />
<br />
An IPv6 Traceroute from the web is shown below:<br />
<br />
traceroute to 2001:630:40:ef0:230:48ff:fe5a:4b7 (2001:630:40:ef0:230:48ff:fe5a:4b7), 30 hops max, 40 byte packets<br />
1 2001:1af8:4200:b000::1 (2001:1af8:4200:b000::1) 1.600 ms 1.813 ms 1.882 ms<br />
2 2001:1af8:4100::5 (2001:1af8:4100::5) 1.320 ms 1.392 ms 1.465 ms<br />
3 be11.crs.evo.leaseweb.net (2001:1af8::9) 2.587 ms 2.631 ms 2.619 ms<br />
4 linx-gw1.ja.net (2001:7f8:4::312:1) 8.475 ms 8.466 ms 8.453 ms<br />
5 ae1.lond-sbr4.ja.net (2001:630:0:10::151) 78.338 ms 78.388 ms 78.376 ms<br />
6 2001:630:0:10::109 (2001:630:0:10::109) 9.900 ms 9.479 ms 9.446 ms<br />
7 so-5-0-0.warr-sbr1.ja.net (2001:630:0:10::36) 13.320 ms 13.196 ms 13.317 ms<br />
8 2001:630:0:10::296 (2001:630:0:10::296) 18.705 ms 18.542 ms 18.793 ms<br />
9 clydenet.glas-sbr1.ja.net (2001:630:0:8044::206) 18.947 ms 18.931 ms 18.948 ms<br />
10 2001:630:42:0:3e::9a (2001:630:42:0:3e::9a) 19.434 ms !X 18.214 ms !X 17.682 ms !X<br />
<br />
<br />
The next phase of testing will be to enable a webserver to speak in both IPv4 and IPv6 using this access mechanism and then onto a Grid services .<br />
<br />
<br />
I will post up a more detailed explanation of the mechanisms used for this soon.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-32189452.post-76712198726730792682011-08-23T16:02:00.000+01:002011-08-23T16:03:13.185+01:00Two Stacks are better than oneLeading on from the last post, we have also re-introduced a new test cluster. This infrastructure is housed within the same rack as our old worker nodes but is completely independent of the production cluster. Supporting a Dell 8024F are 5 servers and a Dell 5000 series switch which are connected via an independent 1 gigabit fibre connection to the University's network.<br />
<br />
The purpose of this cluster is to test IPv4/IPv6 dual stack connectivity for grid Services, the testing of switch based security mechanisms and SL6 NAT testing without fear of impacting the real cluster.<br />
<br />
The IPv6 connectivity model testing will be in multiple phases which include:<br />
<br />
* <a href="http://en.wikipedia.org/wiki/IPv6#Stateless_address_autoconfiguration">SLAAC</a> <br />
* <a href="http://en.wikipedia.org/wiki/IPv6#Configured_and_automated_tunneling_.286in4.29">IPv6 to IPv4 tunneling</a><br />
* IPv6 Routing<br />
<br />
<br />
This framework is designed to comply with the <a href="https://w3.hepix.org/ipv6-bis/doku.php?id=ipv6:testbed">HEPIX IPv6 Project </a>and to look at the possible connection models required by Tier-2s to utilise IPv6. Additionally, we will be testing a wide variety of Grid enabled applications and associated systems such as Nagios to investigate potential issues within a dual stack deployment.<br />
<br />
More on this soon.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-32189452.post-1344702519208652292011-08-23T15:40:00.003+01:002011-08-23T16:04:23.032+01:00Night of the Return of the Living Worker NodesAs Glasgow is currently being used as one of the sets for World War Z, we thought it only apt that we too resurrect the dead and get them to do our bidding. No, we haven't embraced "mad" science. <br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
During the power work we decided to alter the layout of 243d. Historically, the room had housed a mainframe including operators booths. One of these booths still existed within 243d, so we took down one of the walls and added a new cabinet.<br />
<br />
While the work was being conducted to remove the wall we covered the cluster and powered it off to minimise dust ingestion. If you wish to gift wrap a cluster we have plenty of experience in this field. However, our wrapping is limited to blue plastic presently.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgRejfxkrhyOZ1QVmFiKmeW7jR60VqaKnyzVH6ZHhWtX-cehlKhdE1zraBq7IRb1VSLZ89gDXvnqNZP9rL_YCtBSK5bMRSZ5MmqYEw6qa5-4rCpW1EHpRuT8sxz0vDOPzmyUwbBCw/s1600/cluster.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="191" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgRejfxkrhyOZ1QVmFiKmeW7jR60VqaKnyzVH6ZHhWtX-cehlKhdE1zraBq7IRb1VSLZ89gDXvnqNZP9rL_YCtBSK5bMRSZ5MmqYEw6qa5-4rCpW1EHpRuT8sxz0vDOPzmyUwbBCw/s320/cluster.jpg" width="320" /></a></div>
<br />
<br />
After the wall had been removed, we cleared out the computer room and re-organised the storage cabinets, cabling and computing cabinets. In 243d there were a pile of 6 year old disused worker nodes and racked worker nodes whose PDU had been damaged during one of our many power cuts over the last 12 months. In addition to this we found and rebuilt a Dell Rack and also we had a spare Nortel 5510 switch.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhiYx1hyf9JR4xTPyKnotM5pY6DL9pj-ao3I8l3HQogGEzWZ7OMZ5wiTEfPRR1WrITwB0MrZjdbU47NAhlaG9d3YqrFeOPi8cUs59h9Boyq5g3NeWX4T_nTqsc7_n6USHSpgHAcQg/s1600/IMG_0331.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhiYx1hyf9JR4xTPyKnotM5pY6DL9pj-ao3I8l3HQogGEzWZ7OMZ5wiTEfPRR1WrITwB0MrZjdbU47NAhlaG9d3YqrFeOPi8cUs59h9Boyq5g3NeWX4T_nTqsc7_n6USHSpgHAcQg/s1600/IMG_0331.jpg" /></a></div>
<br />
<br />
<br />
With the newly available space from the removal of the wall in 243d, we got a tile cut and deployed the rack. The rack connects back to the older Stack01 via a copper gigabit Ethernet connection. This deployment will give us up to approximately 100 job slots once they are fully configured.<br />
<br />
<br />
<br />
<br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-32189452.post-47130621773140350542011-08-12T16:20:00.004+01:002011-08-12T16:22:48.717+01:00Running at capacity again
<br />... after the shutdown. Slightly delayed due to a coming back during a low point in Atlas work, which is now past us.
<br />
<br />Here's a graph of data moved from our storage element, and you can probably pick out the rather subtle peak when the last batch of analysis traffic started (taking us up to capacity):
<br />
<br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7bCbTu9Znce3fcBMdD85wbwR4Nu7MdEqfK8sia-MLdM3RXDST94u4vnrCiNpynotyd1rcU4wetiMAq26FSxM3Z-aHaZaRNGxozKYU28KxsV7WIZ5cMGKCHf6QcY3gqEizOQ-Bzg/s1600/SmallAmountOfNetworkTraffic.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 320px; height: 111px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7bCbTu9Znce3fcBMdD85wbwR4Nu7MdEqfK8sia-MLdM3RXDST94u4vnrCiNpynotyd1rcU4wetiMAq26FSxM3Z-aHaZaRNGxozKYU28KxsV7WIZ5cMGKCHf6QcY3gqEizOQ-Bzg/s320/SmallAmountOfNetworkTraffic.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5639990111259509346" /></a>
<br />Stuart Purdiehttp://www.blogger.com/profile/08473287949581285669noreply@blogger.com0tag:blogger.com,1999:blog-32189452.post-12771758044099655892011-08-10T16:45:00.002+01:002011-08-10T16:50:24.510+01:00Power startup, situation (hopefully) normalThe planned power work in the Kelvin Building was completed this morning and we have been transferred back to our proper power feed from the generators. The power startup went smoothly and the building has returned to normal.
<br />
<br />The Scotgrid cluster was restarted after the power was seen to be stable and we came out of downtime at 2.20 pm. We will monitor our situation, but we hope that this power work will improve our stability over the coming months.
<br />David Crookshttp://www.blogger.com/profile/07412551479798045933noreply@blogger.com0