As part of the on-going developments to the Scot Grid cluster at Glasgow, we have decommissioned our final LCG-CE, which resided on SVR021. The removal of this CE allows us to concentrate the support and development of two CE platforms; Cream and ARC. We are planning to conduct a series of tests around the three CREAM CE's we have deployed at Glasgow in an attempt to gain a better understanding of their maximum loading potential for running jobs and how to tweak them to gain the maximum efficiency from this service.
Additionally, we will be testing our availability metrics over the next month as the LCG-CE was one of the corner stones of Steve Lloyd's tests of our overall availability. This will now be monitored primarily through our SRM availability.
The reasons for decommissioning the LCG-CE are that we would be removing it at some point in the near future, all the big VO's do not have issues with submitting to Cream CEs and it simplifies our internal support requirements.
The new servers running Cream are svr008, svr014 and svr026.
Thank you LCG-CE and goodnight.
Monday, February 21, 2011
Covering up problems with CREAM
For some days now, ScotGrid Glasgow has been operating with only CREAM CEs, having turned our final lcg-CE off around the 14th. I'll let Mark cover the details of this in his later post, but I wanted to briefly mention one of the minor configuration details that caused some problems for us initially.
The gridmapdir (usually in /etc/grid-security/gridmapdir ) is a somewhat integral part of the pool account mapping system in LCG/gLite services. It contains one (empty) file for each pool account, plus hard-links to them from each DN(+VOMS Role) mapped to them. Basically, it's a cheap way to ensure that you don't get multiple mapped DNs to the same account (as you can always count the number of hard-links to an inode).
We share our gridmapdir, over NFS, to all of our CEs, to ensure that any incoming job from a given user is consistently mapped. Unfortunately, this lead to our minor configuration gaffe (which I just fixed).
The lcg-CE, you see, is configured to set the ownership and permissions on the gridmapdir to 0755 root:root. This is fine for it, since lcg-CEs do strange things like running their services with root permissions, and it prevents anything else from messing up the mappings.
CREAM CEs (using glexec), need to have their gridmapdir as 0775 root:glexec, a change which we hadn't made when we installed them (and which probably YAIM couldn't have done for us). This meant that, for the time the CREAM CEs were installed, they've never been able to create a new mapping in the gridmapdir, as they try to do that as members of the glexec group.
We never really noticed this problem while we had lcg-CEs which were busy, as the lcg-CE would almost always have also received jobs from the user previously and already performed the mapping.
Now that we don't have an lcg-CE, however, it started to cause some odd problems when we enabled new VOs, as the configuration seemed perfectly fine for the VO itself, but jobs would bounce off the CREAM CEs with "Failed to get the local userid with glexec" errors.
Obviously, this was trivially solved once we worked out what the issue was (by setting the gridmapdir's group-ownership and permissions to glexec g+w), but identifying it was a little tricky, as the default logging level for LCMAPS doesn't give many clues as to what problem it's having.
Turning the debug level up to 3 (in /opt/glite/etc/glexec.conf ) was sufficient to get it to log errors with gridmapdir_newlease(), however, and then, after some poking (and manual creation of DN links to see what happened), the problem became clear.
So, this is a cautionary tale about moving from a mixed CE environment to a monoculture (ignoring Stuart's ARC installation) - sometimes a misconfiguration in one service can be hidden by the correct functioning of the service you're just about to remove.
Wednesday, January 19, 2011
My God; it's full of data-transfers!
The Great ATLAS Spacetoken Migration of 2011 kicked off yesterday evening, and with 47TB of data sitting in MCDISK at Glasgow, Brian and We decided to take the opportunity to see how fast we could push it across to DATADISK.
So, since ATLAS Data Management on this case happens over FTS (even though the vast majority of the transfers are internal to a site), we turned up the number of slots for STAR-GLASGOW a bit, from 20 (our default) to 50 (which was fun) up to 80 (although we peaked at around 65 used).
With effectively no limit from FTS, our data rates were... impressive. Although it's an unfair comparison (everyone else was limited by FTS, and we were mostly moving things over the internal network), we managed to hit a peak transfer rate of 1.5GB/s internally (yes, that's 12Gbit/s), and sustain at around 8Gbits/s. That equated to around 2/3s of the total UK data movement over STAR channels, or roughly 2/3s of ATLAS's total traffic in this migration. At that rate, none of our disk servers were stressed, and the network switches were intensely relaxed.
Some exciting graphs follow:
So, since ATLAS Data Management on this case happens over FTS (even though the vast majority of the transfers are internal to a site), we turned up the number of slots for STAR-GLASGOW a bit, from 20 (our default) to 50 (which was fun) up to 80 (although we peaked at around 65 used).
With effectively no limit from FTS, our data rates were... impressive. Although it's an unfair comparison (everyone else was limited by FTS, and we were mostly moving things over the internal network), we managed to hit a peak transfer rate of 1.5GB/s internally (yes, that's 12Gbit/s), and sustain at around 8Gbits/s. That equated to around 2/3s of the total UK data movement over STAR channels, or roughly 2/3s of ATLAS's total traffic in this migration. At that rate, none of our disk servers were stressed, and the network switches were intensely relaxed.
Some exciting graphs follow:
Tuesday, December 14, 2010
Clotted CREAM
Last time I was blogging, I mentioned some problem with our CREAM CE, and too many jobs in the Blah Registry.
Unlike my initial theory, the all_done interval problem turned out to not be the culprit; instead it was down to the Blah Registry.
CREAM splits the whole deal with being a Compute Element into two main parts: the interaction with the wider world, which is handled with some Java code using Tomcat; and the direct interaction with the batch system, called BLAH, and written in C and shell script.
The Java code, which I'll refer to as CREAM, as distinct from the BLAH parts, keeps it's state in the MySQL database. BLAH, on the other hand, uses a hand rolled indexed file, with C functions for accessing and writing data.
The BLAH registry is updated by the command blah_job_registry_add after the qsub is complete; to record the mapping between the CREAM job ID and batch system job id. This is the step were we ran into problems. The version of CREAM we were running was set to purge jobs after about two months - and in two months we were putting just over half a million jobs through it.
With that many jobs in the registry, it was taking a noticeable time to add any job. Further, the locking done effectively serialises access to the registry (i.e. Table locking in RDBMS parlance). Couple that with the Atlas pilot factory's favourite habit of dumping jobs in batches of 10 to 20 at a time, and you can see how some jobs ended up taking longer than the timeout to register.
Just before we'd encountered this, there was a new version of CREAM released (glite-CREAM-3.2.8) that cut the default time before purging to about one month, and put the indices in a mmaped file; both should mitigate this problem. We limped along with some workarounds for a bit [0], before doing that update earlier this week. The update from 3.2.7 to 3.2.8 went very quickly, by the way; took us about 5 minutes; although we did have to manually tidy up /etc/sudoers.
As it stands now, with about quarter of a million jobs in the registry, it's taking about a couple of seconds to register a job; but with occasional pauses when there are many jobs pending. Thus far it's prevented a recurrence of large number of blocked jobs, but I'll be keeping an eye on it.
[0] The other CE's were having hardware issues, and we didn't want to have all the CE's down at once...
Unlike my initial theory, the all_done interval problem turned out to not be the culprit; instead it was down to the Blah Registry.
CREAM splits the whole deal with being a Compute Element into two main parts: the interaction with the wider world, which is handled with some Java code using Tomcat; and the direct interaction with the batch system, called BLAH, and written in C and shell script.
The Java code, which I'll refer to as CREAM, as distinct from the BLAH parts, keeps it's state in the MySQL database. BLAH, on the other hand, uses a hand rolled indexed file, with C functions for accessing and writing data.
The BLAH registry is updated by the command blah_job_registry_add after the qsub is complete; to record the mapping between the CREAM job ID and batch system job id. This is the step were we ran into problems. The version of CREAM we were running was set to purge jobs after about two months - and in two months we were putting just over half a million jobs through it.
With that many jobs in the registry, it was taking a noticeable time to add any job. Further, the locking done effectively serialises access to the registry (i.e. Table locking in RDBMS parlance). Couple that with the Atlas pilot factory's favourite habit of dumping jobs in batches of 10 to 20 at a time, and you can see how some jobs ended up taking longer than the timeout to register.
Just before we'd encountered this, there was a new version of CREAM released (glite-CREAM-3.2.8) that cut the default time before purging to about one month, and put the indices in a mmaped file; both should mitigate this problem. We limped along with some workarounds for a bit [0], before doing that update earlier this week. The update from 3.2.7 to 3.2.8 went very quickly, by the way; took us about 5 minutes; although we did have to manually tidy up /etc/sudoers.
As it stands now, with about quarter of a million jobs in the registry, it's taking about a couple of seconds to register a job; but with occasional pauses when there are many jobs pending. Thus far it's prevented a recurrence of large number of blocked jobs, but I'll be keeping an eye on it.
[0] The other CE's were having hardware issues, and we didn't want to have all the CE's down at once...
Monday, November 29, 2010
Scotgrid weekend Downtime
Due to an issue with one of the environmental control units relating to our water cooling system we had to take part of the cluster down over the weekend. The issue has now been identified and rectified. Normal service was resumed this afternoon for the entire cluster.
Friday, November 26, 2010
glite-APEL installation
This is my (belated) first post on the Scotgrid blog since I joined Scotgrid Glasgow in August as a System/Data Manager, so hello to everyone.
One thing that we have had planned for a while was to install a glite-APEL publishing server, which I put in place earlier this week. The install process was straightforward following these guides: glite-APEL GOC wiki and Moving APEL to SL5. I found a couple of issues which might be interesting for anyone else installing the service, which I've written up in a wiki page on the Scotgrid wiki: glite-APEL installation notes . One thing in particular that I'd be aware of (which is also mentioned in the other links above) is to make sure that keytool is linked to the correct version before running YAIM - see the wiki link for more details on what we found.
Although we'll keep an eye on the new server over the next few days to make sure that it is behaving correctly, everything seems to have gone smoothly.
One thing that we have had planned for a while was to install a glite-APEL publishing server, which I put in place earlier this week. The install process was straightforward following these guides: glite-APEL GOC wiki and Moving APEL to SL5. I found a couple of issues which might be interesting for anyone else installing the service, which I've written up in a wiki page on the Scotgrid wiki: glite-APEL installation notes . One thing in particular that I'd be aware of (which is also mentioned in the other links above) is to make sure that keytool is linked to the correct version before running YAIM - see the wiki link for more details on what we found.
Although we'll keep an eye on the new server over the next few days to make sure that it is behaving correctly, everything seems to have gone smoothly.
Second Cream CE for Glasgow Second Steps
The install of the second Cream CE has now been completed after a series of small set backs surrounding the validity of the software image held on our mirror in Glasgow, which has now been updated.
The commands for the install are available on the ScotGrid Wiki:
http://www.scotgrid.ac.uk/wiki/index.php/Glasgow_GLite_Cream_CE_installation
After testing that the CE was publishing correctly for the cluster and could accept jobs. It was successfully tested with ATLAS pilot jobs. From this point its status in the GOCDB was changed from an LCG-CE to a Cream-CE and it has now entered production. We will monitor this new CE to make sure that it is functioning optimally over the next couple of weeks.
The commands for the install are available on the ScotGrid Wiki:
http://www.scotgrid.ac.uk/wiki/index.php/Glasgow_GLite_Cream_CE_installation
After testing that the CE was publishing correctly for the cluster and could accept jobs. It was successfully tested with ATLAS pilot jobs. From this point its status in the GOCDB was changed from an LCG-CE to a Cream-CE and it has now entered production. We will monitor this new CE to make sure that it is functioning optimally over the next couple of weeks.
Thursday, November 25, 2010
Woops, there go the pool accounts...
We got a ticket on Tuesday because an ATLAS user couldn't get their files back from one disk server which had run out of its 600 (!) ATLAS mapped pool accounts.
I did a bit of a hacky clean up, but actually this is very safe because, unlike a CE, there are no files involved to be mis-inherited by a subsequent user. The only issue would occur at the very moment that a user tried to transfer files.
The clean up removed the oldest mappings, and even the busiest server was down to ~150 mappings and ~450 free slots, so adequate breathing room was gained.
Sam is going to think about this in the storage group concept and write a more general tidier-upper for all VOs.
I did a bit of a hacky clean up, but actually this is very safe because, unlike a CE, there are no files involved to be mis-inherited by a subsequent user. The only issue would occur at the very moment that a user tried to transfer files.
The clean up removed the oldest mappings, and even the busiest server was down to ~150 mappings and ~450 free slots, so adequate breathing room was gained.
Sam is going to think about this in the storage group concept and write a more general tidier-upper for all VOs.
Stale CREAM and Maui partitioning
Nothing terribly exciting; but we've done a bit of an update on our Maui configuration, and CREAM problem has been cleared a-whey.
Previously, we've had our different era's of compute nodes annotated with the key speed notifier; so that the scheduler understands the fact that some are faster. For the vast majority of grid jobs, this is an utterly irrelevant distinction - so long as the job gets the time it expects (and Maui is scaling requests that go to the slower nodes so they get longer).
However jobs that use more than one process (i.e. MPI jobs) are a different case - if they get scheduled with different classes of nodes, then you get sub-optimal resource useage. So it's useful to keep some distinction between them. Previously we've been using reservations to restrict where jobs can go - but there's an (ill-defined) upper bound to the maximum number of overlapping reservations on a single job slot at once; too many breaks things.
So we had a look at Partitions in Maui, which is really the proper way to handle these. The downside is that you're limited to 3 partitions - there's a compiled in limit of 4, one of which is the special [ALL] partition, and one is DEFAULT. Fortunately, we have 3 era's of kit - and as long as we're happy calling one 'DEFAULT', it all works. And Maui understands never to schedule a job to more than one partition at a time.
So we ended up with a lot of lines like:
Not terribly exciting Maui tweaks, but sometimes that's the way of it.
The CREAM problem manifested itself as a set of jobs that have gone sour. There's a known problem in the current versions of CREAM where if the job is queueing on PBS for over an hour, CREAM (rather, the BLAH parser) thinks it's dead, and kills the job, with reason=999.
What's not made explicit is that you need to have no spaces in that! i.e. you must have
because if you put alldone_interval = 7200, then CREAM doesn't understand that. So fixed that, and it was all hunky dory for a while. Then we started getting lots of blocked jobs in Torque again; all from CREAM.
Cue more digging.
Eventually found this in the CREAM logs (after a slight reformatting):
JOB CREAM942835851 STATUS CHANGED: PENDING => ABORTED
[failureReason=BLAH error: no jobId in submission script's output
(stdout:) (stderr:/opt/glite/etc/blah.config: line 81: alldone_interval: command not found-
execute_cmd: 200 seconds timeout expired, killing child process. )
So, two things here. Firstly, alldone_interval with spaces had crept back in, along with the correct version - in our case via cfengine directives, probably down to the Double CREAM plan. More interesting was that having the invalid part of the BLAH config present slows down BLAH (BUpdaterPBS was pegging a CPU at 100%), sufficent that it hits another timeout at 200 seconds to respond at all. And then CREAM kills the job, but doesn't actually tell Torque, sand box is blown away, so it can't finish (nowhere to put output), or, if not started, can't start.
Removing the second (wrong) version of the alldone_interval fixed that - CPU use in the parser dropped to trival levels, and all appears to be happy again. This one's not really CREAMS fault, but it's always good to have an idea of what misconfigured services end up doing, otherwise it's hard to fix those 'not enough coffee' incidents. Hence, this one for Google...
UPDATE: Oops! Spoke too soon. That's the problem defined, but clearly not the solution - as it's happened again. Gonna leave this here as a reminder to self to give it a bit longer before considering something fixed...
Previously, we've had our different era's of compute nodes annotated with the key speed notifier; so that the scheduler understands the fact that some are faster. For the vast majority of grid jobs, this is an utterly irrelevant distinction - so long as the job gets the time it expects (and Maui is scaling requests that go to the slower nodes so they get longer).
However jobs that use more than one process (i.e. MPI jobs) are a different case - if they get scheduled with different classes of nodes, then you get sub-optimal resource useage. So it's useful to keep some distinction between them. Previously we've been using reservations to restrict where jobs can go - but there's an (ill-defined) upper bound to the maximum number of overlapping reservations on a single job slot at once; too many breaks things.
So we had a look at Partitions in Maui, which is really the proper way to handle these. The downside is that you're limited to 3 partitions - there's a compiled in limit of 4, one of which is the special [ALL] partition, and one is DEFAULT. Fortunately, we have 3 era's of kit - and as long as we're happy calling one 'DEFAULT', it all works. And Maui understands never to schedule a job to more than one partition at a time.
So we ended up with a lot of lines like:
NODECFG[node060] SPEED=0.94 PARTITION=cvoldBut in order to make them used by all jobs, we had to adjust the default partitions available to include them all:
SYSCFG PLIST=DEFAULT:cvoldwhich gives all users equal access to all the partitions.
Not terribly exciting Maui tweaks, but sometimes that's the way of it.
The CREAM problem manifested itself as a set of jobs that have gone sour. There's a known problem in the current versions of CREAM where if the job is queueing on PBS for over an hour, CREAM (rather, the BLAH parser) thinks it's dead, and kills the job, with reason=999.
What's not made explicit is that you need to have no spaces in that! i.e. you must have
alldone_interval=7200
because if you put alldone_interval = 7200, then CREAM doesn't understand that. So fixed that, and it was all hunky dory for a while. Then we started getting lots of blocked jobs in Torque again; all from CREAM.
Cue more digging.
Eventually found this in the CREAM logs (after a slight reformatting):
JOB CREAM942835851 STATUS CHANGED: PENDING => ABORTED
[failureReason=BLAH error: no jobId in submission script's output
(stdout:) (stderr:/opt/glite/etc/blah.config: line 81: alldone_interval: command not found-
So, two things here. Firstly, alldone_interval with spaces had crept back in, along with the correct version - in our case via cfengine directives, probably down to the Double CREAM plan. More interesting was that having the invalid part of the BLAH config present slows down BLAH (BUpdaterPBS was pegging a CPU at 100%), sufficent that it hits another timeout at 200 seconds to respond at all. And then CREAM kills the job, but doesn't actually tell Torque, sand box is blown away, so it can't finish (nowhere to put output), or, if not started, can't start.
UPDATE: Oops! Spoke too soon. That's the problem defined, but clearly not the solution - as it's happened again. Gonna leave this here as a reminder to self to give it a bit longer before considering something fixed...
Friday, November 19, 2010
Second Cream CE for Glasgow First steps
We are currently in the process of installing a second Cream CE at Glasgow. This will replace one of the LCG CEs at Glasgow. As this is my first major service install since joining ScotGrid and the Gridpp project at the end of August I thought I would share the process for this type of service change with the wider community.
" For multiple CE's with shared queues. Edit the gip file on the CE you wish to drain. This blocks WMS submission:
vim /opt/lcg/libexec/lcg-info-dynamic-pbs change: push @output, "GlueCEStateStatus: $Status\n" to: push @output, "GlueCEStateStatus: Draining\n" "
" on the batch machine: vim /etc/hosts.equiv comment out the machine you wish to stop accepting jobs and restart maui:
svr016:~# cat /etc/hosts.equiv
svr021.gla.scotgrid.ac.uk #svr026.gla.scotgrid.ac.uk "
However the GOCDB was not updated by myself to indicate scheduled
downtime for this service change and after a GGUS ticket this was quickly
rectified. We are waiting on the jobs to drain from the LCG-CE just now
before continuing with the install early next week.
Tuesday, November 02, 2010
Normal Services Resume
On Friday the 29th of October, the ScotGrid, Glasgow site was impacted by two power outages at 15:25 and 15:40. These power cuts weren't localised to just the ScotGrid Glasgow site but also impacted other parts of the west end of Glasgow. These outages resulted in the site being placed in unscheduled downtime as we wanted to ensure that the power feed into the site was stable prior to returning the site to full production.
On Monday the 1st of November we re-checked all essential core services, boosted our UPS capability and then re-checked all services were functioning correctly prior to the site re-entering full production.
By 17:15 on Monday night we were expecting ATLAS jobs and the site is now back to a normal functioning basis.
Interestingly enough our new 10 Gig Core re-acted as planned and rebooted in full operational mode minutes after each outage and was completely stable over the weekend, the new cluster equipment was also functioning correctly after both outages. In addition to this the older cluster equipment was not badly affected by these power losses either.
The site is now getting back to a normal functioning status.
On Monday the 1st of November we re-checked all essential core services, boosted our UPS capability and then re-checked all services were functioning correctly prior to the site re-entering full production.
By 17:15 on Monday night we were expecting ATLAS jobs and the site is now back to a normal functioning basis.
Interestingly enough our new 10 Gig Core re-acted as planned and rebooted in full operational mode minutes after each outage and was completely stable over the weekend, the new cluster equipment was also functioning correctly after both outages. In addition to this the older cluster equipment was not badly affected by these power losses either.
The site is now getting back to a normal functioning status.
Tuesday, October 19, 2010
CHEP 2010
If things have appeared to be quiet these days, it's mostly because they're anything but! A few changes in staff and new hardware are directing attention; along with conference prep.
Which is where I am right now; CHEP 2010 in Taiwan. And since we arrived it's been raining constantly; makes me feel right at home!
In addition to presenting our work with ARC, it's also interesting to see what's going on elsewhere. From the same session that I was speaking in, there was a talk about Virtual Machine optimisation - which I think will be worth a look when we get back home. It appears that doing some small tweaks can reduce the overhead, in particular the idle time CPU consumption. Although we don't do major computation inside the VM's, by using them for services they spend a good portion of their time idle - so tuning that might be a cunning plan for us.
Which is where I am right now; CHEP 2010 in Taiwan. And since we arrived it's been raining constantly; makes me feel right at home!
In addition to presenting our work with ARC, it's also interesting to see what's going on elsewhere. From the same session that I was speaking in, there was a talk about Virtual Machine optimisation - which I think will be worth a look when we get back home. It appears that doing some small tweaks can reduce the overhead, in particular the idle time CPU consumption. Although we don't do major computation inside the VM's, by using them for services they spend a good portion of their time idle - so tuning that might be a cunning plan for us.
Tuesday, September 14, 2010
EGI Technical Forum 2010
A few of us are in Amsterdam this week attending the EGI Technical Forum. The rather interesting programme really got underway after lunch today so, since there are three of us over here, we spread ourselves out around the many parallel meetings.
Mike attended the "Virtual Research Communities" session (apparently these replace what we currently call "Virtual Organisations") and discovered a wealth of acronyms that he'd never heard of before; DARIAH, NEXPReS, EnviroGRIDS, e-NMR etc. In all there were seven potential VRC's represented, each of which had a ten minute slot in which to provide a summary of their research field and outline their requirements.
It turns out these seemingly disparate communities have broadly similar needs (authentication, authorization, data management etc) and don't necessarily have (or want to become) computing science experts. Who would've thought it.
You know, what we need is some sort of Integrated Sustainable Pan-European Infrastructure for Researchers in Europe. Oh.
Thursday, August 19, 2010
Why, yes ... we were using that...
So .... remind me never to do a 'nothing much happening' post again. It looks like tempting fate results in Interesting Times.
Our cooling setup in one of the rooms is a bit quirky; based on a chilled water system (long story, but it was originally built for cooling a laser before we ended up with it). There's been a few blips with the water supply, so duely an engineer was dispatched to have a poke at it.
The 'poke' in this cases involved switching it off, until he could delve into the midsts of the machine, resulting in the rather exciting peak in temperatures (these measured using the on board thermal sensors in the worker nodes).
We were supposed to get a warning from the building systems when the chiller went offline, and again when the water supply temperature rose too high. (The air temperature lags behind the water temp, so it's a good early warning). As neither of those happened, our first warning was the air temperature in the room, followed by the nodes internal sensor alarms.
First course of action was to offline the nodes, and then find the cause of the problem. Once found, there was a short ... Explanation ... of why that was a Bad Time to switch off the chiller. We'll schedule some downtime to get it done later; at some point when we're not loaded with production jobs.
Still, little incidents like this are a good test for the procedures. Everything went pretty smoothly, from offlining nodes to stop them picking up new jobs, through to the defence in depth of multiple layers of monitoring systems.
Thankfully, we didn't need to do anything drastic (like hard powering off a rack); so we now know how long we have from a total failure of cooling until the effects kick in. Time to sit down and do some sums, to make sure we could handle a cooling failure at full load that occurs at 3am...
Never mind "sums", I took the physicist's approach a couple of years ago and got some real data:

Triangles (offset slightly along x-axis for clarity) are the temperatures of worker nodes as reckoned by IPMI; stars are input air temperatures to the three downflow units in room 141 and the squares are flow/return water temperatures. I simulated a total loss of cooling by switching the chilled water pump off; all worker nodes were operating at their maximum nominal load. It took ~20 minutes for the worker node temperatures to reach 40 degrees, at which point I bottled it and restored cooling. So, for good reason, we now run a script that monitors node temperatures, and has the ability to power them off once a temperature threshold is breached. Oh, and that has been tested in anger.
Our cooling setup in one of the rooms is a bit quirky; based on a chilled water system (long story, but it was originally built for cooling a laser before we ended up with it). There's been a few blips with the water supply, so duely an engineer was dispatched to have a poke at it.

We were supposed to get a warning from the building systems when the chiller went offline, and again when the water supply temperature rose too high. (The air temperature lags behind the water temp, so it's a good early warning). As neither of those happened, our first warning was the air temperature in the room, followed by the nodes internal sensor alarms.
First course of action was to offline the nodes, and then find the cause of the problem. Once found, there was a short ... Explanation ... of why that was a Bad Time to switch off the chiller. We'll schedule some downtime to get it done later; at some point when we're not loaded with production jobs.
Still, little incidents like this are a good test for the procedures. Everything went pretty smoothly, from offlining nodes to stop them picking up new jobs, through to the defence in depth of multiple layers of monitoring systems.
Thankfully, we didn't need to do anything drastic (like hard powering off a rack); so we now know how long we have from a total failure of cooling until the effects kick in. Time to sit down and do some sums, to make sure we could handle a cooling failure at full load that occurs at 3am...
Update: 19/08/2010 by Mike
Never mind "sums", I took the physicist's approach a couple of years ago and got some real data:

Triangles (offset slightly along x-axis for clarity) are the temperatures of worker nodes as reckoned by IPMI; stars are input air temperatures to the three downflow units in room 141 and the squares are flow/return water temperatures. I simulated a total loss of cooling by switching the chilled water pump off; all worker nodes were operating at their maximum nominal load. It took ~20 minutes for the worker node temperatures to reach 40 degrees, at which point I bottled it and restored cooling. So, for good reason, we now run a script that monitors node temperatures, and has the ability to power them off once a temperature threshold is breached. Oh, and that has been tested in anger.
Business as unusual
There's a been a lot of little things happening up here; individually none of them quite big enough to blog about.
And after a while, it's worth doing a catch up post about them. This is that post.
David started a couple of weeks ago, and Mark is starting on Monday; just in time for the GridPP meeting. It's seeming to be a tradition that every time we get new hardware, the staff rotate; Dug and myself started just around the last hardware upgrade.
The hardware this time is mostly a petabyte of storage to be added, so David's been working on ways of testing the disks before we sign off on them.
GridPP; next week. Usual round of site reports, and future planning. With the data from the LHC now a routine matter, it's time to start thinking about future needs. I'll be talking about non-(particle)-physcists on the Grid, as a nod towards the longer term EGI picture.
We noticed some load balancing issues on our SL5 disk pool nodes; Sam's been poking at that, and it looks like there's a mix of issues, from filesystem type (ext4 is better than xfs here), and clustering of files onto nodes.
And that's most of the interesting stuff from up here. Hopefully we'll have more to post about over the next few months
And after a while, it's worth doing a catch up post about them. This is that post.
David started a couple of weeks ago, and Mark is starting on Monday; just in time for the GridPP meeting. It's seeming to be a tradition that every time we get new hardware, the staff rotate; Dug and myself started just around the last hardware upgrade.
The hardware this time is mostly a petabyte of storage to be added, so David's been working on ways of testing the disks before we sign off on them.
GridPP; next week. Usual round of site reports, and future planning. With the data from the LHC now a routine matter, it's time to start thinking about future needs. I'll be talking about non-(particle)-physcists on the Grid, as a nod towards the longer term EGI picture.
We noticed some load balancing issues on our SL5 disk pool nodes; Sam's been poking at that, and it looks like there's a mix of issues, from filesystem type (ext4 is better than xfs here), and clustering of files onto nodes.
And that's most of the interesting stuff from up here. Hopefully we'll have more to post about over the next few months
Thursday, July 15, 2010
LHCb transfers redux
Since the last time we mentioned LHCb, we thought we had the problem licked.
Sadly, we were mistaken.
Like a Matryoshka doll, inside the first problem we found lurked another. This one was more widespread, however.
Although we'd fixed the problem of failing jobs, during the course of each job there were a noticeable number of transfer failures. That is, the job first attempted to send the data back to CERN, then if that failed, tried a number of other places until it eventually worked. Notably transfers to PIC always seemed to work fine.
During some other work involving ARC, I ended up tuning the TCP stack parameters on a service node, and noticed that we were using the default parameters on our worker nodes. This lead down a rabbit hole, till eventually finding a solution.
The first idea was to tune the worker nodes for transfers to CERN, to see if making the transfers faster made more complete in time (and thus fewer failures). Some tinkering suggested that the values that YAIM puts on a DPM pool node were decent choices, so slapped them in cfengine, and away we went.
Problem cured.
Surprise.
Working out what was happening took a bit longer, and was down to Rob Fay at Liverpool.
Part of the tuning that YAIM does is to turn of SACK and DSACK. The other parts, about adjusting initial buffer sizes turned out not to be relevant here. So why was SACK causing problems, and why was YAIM switching it off for the DPM pool nodes?
Well, there's a bug in Linux contrac module that thinks that SACK packets are invalid, and thus won't forward them. If it's the recipient of the packets, it's all fine, but the forwarding code was fixed in 2.6.26 (2 years ago!), and before that it would reject the SACK packets, which caused the connection to eventually revert to conventional ACKs. SL5.3 uses a 2.6.18 kernel.
As to why YAIM turns it off for DPM pool nodes; apparently because that's what YAIM did for CASTOR pool nodes at the time the YAIM module was written. (It doesn't today). This also explains why the transfers to PIC always worked - SACK needs both sides to agree to use it, and PIC uses a DPM (hence no SACK).
So, upshot of all of that is that transferring from worker nodes to a storage element (that's not DPM) going through a NAT will be hit by this bug, crippling performance.
Solutions to this are, in rough order of preference:
1. Always transfer to local storage and stage on from there.
2. Don't use NATs.
3. If you have to transfer to remote storage, and have to use a NAT, turn off SACK and DSACK.
Sadly, we were mistaken.
Like a Matryoshka doll, inside the first problem we found lurked another. This one was more widespread, however.
Although we'd fixed the problem of failing jobs, during the course of each job there were a noticeable number of transfer failures. That is, the job first attempted to send the data back to CERN, then if that failed, tried a number of other places until it eventually worked. Notably transfers to PIC always seemed to work fine.
During some other work involving ARC, I ended up tuning the TCP stack parameters on a service node, and noticed that we were using the default parameters on our worker nodes. This lead down a rabbit hole, till eventually finding a solution.
The first idea was to tune the worker nodes for transfers to CERN, to see if making the transfers faster made more complete in time (and thus fewer failures). Some tinkering suggested that the values that YAIM puts on a DPM pool node were decent choices, so slapped them in cfengine, and away we went.
Problem cured.
Surprise.
Working out what was happening took a bit longer, and was down to Rob Fay at Liverpool.
Part of the tuning that YAIM does is to turn of SACK and DSACK. The other parts, about adjusting initial buffer sizes turned out not to be relevant here. So why was SACK causing problems, and why was YAIM switching it off for the DPM pool nodes?
Well, there's a bug in Linux contrac module that thinks that SACK packets are invalid, and thus won't forward them. If it's the recipient of the packets, it's all fine, but the forwarding code was fixed in 2.6.26 (2 years ago!), and before that it would reject the SACK packets, which caused the connection to eventually revert to conventional ACKs. SL5.3 uses a 2.6.18 kernel.
As to why YAIM turns it off for DPM pool nodes; apparently because that's what YAIM did for CASTOR pool nodes at the time the YAIM module was written. (It doesn't today). This also explains why the transfers to PIC always worked - SACK needs both sides to agree to use it, and PIC uses a DPM (hence no SACK).
So, upshot of all of that is that transferring from worker nodes to a storage element (that's not DPM) going through a NAT will be hit by this bug, crippling performance.
Solutions to this are, in rough order of preference:
1. Always transfer to local storage and stage on from there.
2. Don't use NATs.
3. If you have to transfer to remote storage, and have to use a NAT, turn off SACK and DSACK.
Friday, July 09, 2010
WLCG Workshop
ScotGrid-Glasgow was well represented this week at the WLCG Collaboration Workshop, with the Tier-2 coordinator, site admin and data manager in attendance.
Mike gave a talk outlining steps taken at Glasgow and other Tier-2 sites within the UK to provide effective end user support, both in the WLCG context and also for the smaller VOs.
Graeme, wearing his ATLAS hat, presented the WLCG Service from the Experiments' Viewpoint.
Sam took the opportunity to meet with data-management developers and experts, discussing future plans and pledging Glasgow resources in the form of development-class servers.
The event was covered with photos, video and a blog over at GridCast. Speaking of which...we're don't usually blow our own trumpet (too loudly) at Glasgow, but when it comes from this man, it's worth shouting about!
Mike gave a talk outlining steps taken at Glasgow and other Tier-2 sites within the UK to provide effective end user support, both in the WLCG context and also for the smaller VOs.
Graeme, wearing his ATLAS hat, presented the WLCG Service from the Experiments' Viewpoint.
Sam took the opportunity to meet with data-management developers and experts, discussing future plans and pledging Glasgow resources in the form of development-class servers.
The event was covered with photos, video and a blog over at GridCast. Speaking of which...we're don't usually blow our own trumpet (too loudly) at Glasgow, but when it comes from this man, it's worth shouting about!
Tuesday, June 22, 2010
A baffling spot of localised cooling
How do you keep your cool in this sort of weather? Well, there's various options, but I'll bet one you've not tried is wrapping up in lots of insuating foam.
And yet, that's been just the ticket for some worker nodes up here; despite it being one of the warmer days (23° C outside). Have a look at the temperature graph, and see if you can spot when something changed:
(The peak at midnight was due to a sneak attack Hammercloud; it was just before 12 when I put in the insulation.)
I'd discovered that there's some empty head space at the top of the racks. In those racks were there's a network switch at the top, this wasn't doing much, but where there were worker nodes, the top node was a lot hotter than the node two down from it. That's a lot sharper change than I'd expected - it was noticeable by touching the metal cover on the front of the nodes. The theory was that hot air out the back of the nodes was being sucked forward over the top of the highest node (through the headspace), and then recirculated round, getting hotter, until the steady state of it was about 5 K hotter that the others.
So, it was time to do something about that. First couple attempts at stopping up the gap didn't have much effect, until I dug out a few bits of packing foam (that the nodes were shipped in). Being, of course, the correct width, and jut a bit taller than 1U, they fit snugly into the headspace.
And that foam baffle reduced the temperature; to the point that the node at the top of the racks are now at the lowest temperature since records began! (i.e. they were installed.) Counter intuitive, but that's the way air/heat flow goes sometimes.
Although these worker nodes are due for replacement, we're going to be reusing the racks themselves, so little things like this are good to know. It may be that this won't be a problem with the new worker nodes - or it might be the case that it'd be worse. Either way, forewarns is fore armed (and cooler).
And yet, that's been just the ticket for some worker nodes up here; despite it being one of the warmer days (23° C outside). Have a look at the temperature graph, and see if you can spot when something changed:

I'd discovered that there's some empty head space at the top of the racks. In those racks were there's a network switch at the top, this wasn't doing much, but where there were worker nodes, the top node was a lot hotter than the node two down from it. That's a lot sharper change than I'd expected - it was noticeable by touching the metal cover on the front of the nodes. The theory was that hot air out the back of the nodes was being sucked forward over the top of the highest node (through the headspace), and then recirculated round, getting hotter, until the steady state of it was about 5 K hotter that the others.
So, it was time to do something about that. First couple attempts at stopping up the gap didn't have much effect, until I dug out a few bits of packing foam (that the nodes were shipped in). Being, of course, the correct width, and jut a bit taller than 1U, they fit snugly into the headspace.
And that foam baffle reduced the temperature; to the point that the node at the top of the racks are now at the lowest temperature since records began! (i.e. they were installed.) Counter intuitive, but that's the way air/heat flow goes sometimes.
Although these worker nodes are due for replacement, we're going to be reusing the racks themselves, so little things like this are good to know. It may be that this won't be a problem with the new worker nodes - or it might be the case that it'd be worse. Either way, forewarns is fore armed (and cooler).
Friday, June 04, 2010
Phew, what a scorcher!
Yes, summer has arrived, even in Glasgow, and with it the people of this great city (myself included) are transformed from a whiter-shade-of-pale, into something that can only be described as lobster-esque. We Celts do not tan well.
Alas, it is not all fun and games in the sunshine, because the arrival of fine weather heralds the inevitable air-conditioning problems.
Despite regular love and attention (serviced 3 times a year and recently hosed through with nitrogen) one of our roof-mounted compressors is particularly troublesome. This is most likely a combination of age (~12 years is the best guess) and 24 x 7 load; it serves the warmest corner of our original machine room.
For this reason, I have a site meeting on Monday to discuss options, one of which will hopefully involve the replacement of said compressor before we take delivery of new hardware later this year.
Also under consideration is a home-brew cold-aisle containment system. This will almost certainly be less sophisticated (and cheaper) than our excellent Knuerr racks in the basement, but should result in more intelligent use of the available chilled air.
Until a solution arises, we shall continue to nurse the existing system through the summer months, and take comfort from the fact that there are clearly worse air-conditioning failures a site can suffer...
Alas, it is not all fun and games in the sunshine, because the arrival of fine weather heralds the inevitable air-conditioning problems.
Despite regular love and attention (serviced 3 times a year and recently hosed through with nitrogen) one of our roof-mounted compressors is particularly troublesome. This is most likely a combination of age (~12 years is the best guess) and 24 x 7 load; it serves the warmest corner of our original machine room.
For this reason, I have a site meeting on Monday to discuss options, one of which will hopefully involve the replacement of said compressor before we take delivery of new hardware later this year.
Also under consideration is a home-brew cold-aisle containment system. This will almost certainly be less sophisticated (and cheaper) than our excellent Knuerr racks in the basement, but should result in more intelligent use of the available chilled air.
Until a solution arises, we shall continue to nurse the existing system through the summer months, and take comfort from the fact that there are clearly worse air-conditioning failures a site can suffer...

Tuesday, May 25, 2010
So long and thanks for all the fish
I would just like to say thanks to everyone who I have worked with at ScotGrid, GridPP and EGEE. I couldn't have picked a better time to be working on grid, LCG and WLCG. I have learned a lot, accomplished most of the things I set out to do and hopefully contributed to the project in some small way. I will always be on the other end of an email should you wish to get in touch. So long and thanks for all the fish.
Subscribe to:
Posts (Atom)