Monday, November 29, 2010

Scotgrid weekend Downtime

Due to an issue with one of the environmental control units relating to our water cooling system we had to take part of the cluster down over the weekend. The issue has now been identified and rectified. Normal service was resumed this afternoon for  the entire cluster.

Friday, November 26, 2010

glite-APEL installation

This is my (belated) first post on the Scotgrid blog since I joined Scotgrid Glasgow in August as a System/Data Manager, so hello to everyone.

One thing that we have had planned for a while was to install a glite-APEL publishing server, which I put in place earlier this week. The install process was straightforward following these guides: glite-APEL GOC wiki and Moving APEL to SL5. I found a couple of issues which might be interesting for anyone else installing the service, which I've written up in a wiki page on the Scotgrid wiki: glite-APEL installation notes . One thing in particular that I'd be aware of (which is also mentioned in the other links above) is to make sure that keytool is linked to the correct version before running YAIM - see the wiki link for more details on what we found.

Although we'll keep an eye on the new server over the next few days to make sure that it is behaving correctly, everything seems to have gone smoothly.

Second Cream CE for Glasgow Second Steps

The install of the second Cream CE has now been completed after a series of small set backs surrounding the validity of the software image held on our mirror in Glasgow, which has now been updated.

The commands for the install are available on the ScotGrid Wiki:
http://www.scotgrid.ac.uk/wiki/index.php/Glasgow_GLite_Cream_CE_installation

After testing that the CE was publishing correctly for the cluster and could accept jobs. It was successfully tested with ATLAS pilot jobs. From this point its status in the GOCDB was changed from an LCG-CE to a Cream-CE and it has now entered production. We will monitor this new CE to make sure that it is functioning optimally over the next couple of weeks.

Thursday, November 25, 2010

Woops, there go the pool accounts...

We got a ticket on Tuesday because an ATLAS user couldn't get their files back from one disk server which had run out of its 600 (!) ATLAS mapped pool accounts.

I did a bit of a hacky clean up, but actually this is very safe because, unlike a CE, there are no files involved to be mis-inherited by a subsequent user. The only issue would occur at the very moment that a user tried to transfer files.

The clean up removed the oldest mappings, and even the busiest server was down to ~150 mappings and ~450 free slots, so adequate breathing room was gained.

Sam is going to think about this in the storage group concept and write a more general tidier-upper for all VOs.

Stale CREAM and Maui partitioning

Nothing terribly exciting; but we've done a bit of an update on our Maui configuration, and CREAM problem has been cleared a-whey.

Previously, we've had our different era's of compute nodes annotated with the key speed notifier; so that the scheduler understands the fact that some are faster. For the vast majority of grid jobs, this is an utterly irrelevant distinction - so long as the job gets the time it expects (and Maui is scaling requests that go to the slower nodes so they get longer).

However jobs that use more than one process (i.e. MPI jobs) are a different case - if they get scheduled with different classes of nodes, then you get sub-optimal resource useage. So it's useful to keep some distinction between them. Previously we've been using reservations to restrict where jobs can go - but there's an (ill-defined) upper bound to the maximum number of overlapping reservations on a single job slot at once; too many breaks things.

So we had a look at Partitions in Maui, which is really the proper way to handle these. The downside is that you're limited to 3 partitions - there's a compiled in limit of 4, one of which is the special [ALL] partition, and one is DEFAULT. Fortunately, we have 3 era's of kit - and as long as we're happy calling one 'DEFAULT', it all works. And Maui understands never to schedule a job to more than one partition at a time.

So we ended up with a lot of lines like:
    NODECFG[node060] SPEED=0.94 PARTITION=cvold
But in order to make them used by all jobs, we had to adjust the default partitions available to include them all:
    SYSCFG PLIST=DEFAULT:cvold
which gives all users equal access to all the partitions.

Not terribly exciting Maui tweaks, but sometimes that's the way of it.


The CREAM problem manifested itself as a set of jobs that have gone sour. There's a known problem in the current versions of CREAM where if the job is queueing on PBS for over an hour, CREAM (rather, the BLAH parser) thinks it's dead, and kills the job, with reason=999.

What's not made explicit is that you need to have no spaces in that! i.e. you must have

    alldone_interval=7200

because if you put alldone_interval = 7200, then CREAM doesn't understand that. So fixed that, and it was all hunky dory for a while. Then we started getting lots of blocked jobs in Torque again; all from CREAM.

Cue more digging.

Eventually found this in the CREAM logs (after a slight reformatting):

JOB CREAM942835851 STATUS CHANGED: PENDING => ABORTED
[failureReason=BLAH error: no jobId in submission script's output
(stdout:) (stderr:/opt/glite/etc/blah.config: line 81: alldone_interval: command not found-
execute_cmd: 200 seconds timeout expired, killing child process. )

So, two things here. Firstly, alldone_interval with spaces had crept back in, along with the correct version - in our case via cfengine directives, probably down to the Double CREAM plan. More interesting was that having the invalid part of the BLAH config present slows down BLAH (BUpdaterPBS was pegging a CPU at 100%), sufficent that it hits another timeout at 200 seconds to respond at all. And then CREAM kills the job, but doesn't actually tell Torque, sand box is blown away, so it can't finish (nowhere to put output), or, if not started, can't start.

Removing the second (wrong) version of the alldone_interval fixed that - CPU use in the parser dropped to trival levels, and all appears to be happy again. This one's not really CREAMS fault, but it's always good to have an idea of what misconfigured services end up doing, otherwise it's hard to fix those 'not enough coffee' incidents. Hence, this one for Google...

UPDATE: Oops! Spoke too soon. That's the problem defined, but clearly not the solution - as it's happened again. Gonna leave this here as a reminder to self to give it a bit longer before considering something fixed...

Friday, November 19, 2010

Second Cream CE for Glasgow First steps

We are currently in the process of installing a second Cream CE at Glasgow. This will replace one of the LCG CEs at Glasgow. As this is my first major service install since joining ScotGrid and the Gridpp project at the end of August I thought I would share the process for this type of service change with the wider community.


The first steps undertaken by myself was to drain the current LCG-CE to prepare it for the new install, the commands are shown below.



" For multiple CE's with shared queues. Edit the gip file on the CE you wish to drain. This blocks WMS submission: 

vim /opt/lcg/libexec/lcg-info-dynamic-pbs

change: push @output, "GlueCEStateStatus: $Status\n"
to: push @output, "GlueCEStateStatus: Draining\n" "


" on the batch machine: vim /etc/hosts.equiv comment out the machine you wish to stop accepting jobs and restart maui: 

svr016:~# cat /etc/hosts.equiv
svr021.gla.scotgrid.ac.uk
#svr026.gla.scotgrid.ac.uk "
 
However the GOCDB was not updated by myself to indicate scheduled 
downtime for this service change and after a GGUS ticket this was quickly
rectified.  We are waiting on the jobs to drain from the LCG-CE just now
 before continuing with the install early next week.

Tuesday, November 02, 2010

Normal Services Resume

On Friday the 29th of October, the ScotGrid, Glasgow site was impacted by two power outages at 15:25 and 15:40. These power cuts weren't localised to just the ScotGrid Glasgow site but also impacted other parts of the west end of Glasgow. These outages resulted in the site being placed in unscheduled downtime as we wanted to ensure that the power feed into the site was stable prior to returning the site to full production.

On Monday the 1st of November we re-checked all essential core services, boosted our UPS capability and then re-checked all services were functioning correctly prior to the site re-entering full production.
By 17:15 on Monday night we were expecting ATLAS jobs and the site is now back to a normal functioning basis.

Interestingly enough our new 10 Gig Core re-acted as planned and rebooted in full operational mode minutes after each outage and was completely stable over the weekend, the new cluster equipment was also functioning correctly after both outages. In addition to this the older cluster equipment was not  badly affected by these power losses either.

The site is now getting back to a normal functioning status.