ScotGrid

Tuesday, November 27, 2007

Dem Biomed Blues...

Finally I got fed up with the biomed user whose jobs always stall on the cluster. Banned them and sent in a ticket.

I'm not prepared to be tolerant of crap code when we have 100s of queued jobs.

Monday, November 26, 2007

Health and Efficiency (ATLAS style)

Now that the conversion to the new ATLAS MC production system (panda/pallette/pangea?) is underway, I thought it would be interesting to compare the site's view of efficiency in the new system to the old. I had to fix up our local accounting database, which was truncating some of the longer username fields we have now (e.g., prdatlasNNN). After doing that, I could easily distinguish between panda pilots and other production activities.

Since we upgraded to SL4 in September (which was just about the time that Rod started toying with panda) the scores are:


Lexor/Cronus
+-------+----------+-----------+-----------+
| Jobs  | CPU_Hours| Wall_Hours| Eff       |
+-------+----------+-----------+-----------+
| 20047 | 282434.6 | 533904.0  | 0.52899   |
+-------+----------+-----------+-----------+

Panda
+-------+------------+-----------+------------+
| Jobs  | CPU_Hours  | Wall_Hours| Eff        |
+-------+------------+-----------+------------+
| 17746 | 57312.5925 | 59600.919 | 0.96160584 |
+-------+------------+-----------+------------+

This is quite a different view of "efficiency" to the VO's view, because here the actual success or failure of the job is masked - we're only looking at wall time efficiency in the batch system. However, the improvement here is spectacular, so sites should, I think, be very happy with this change.

Note that the panda figures include all the pilots, even the ones which had no jobs to pick up (production stalled a few times because of dCache problems at RAL and other teething troubles). If one masks these jobs out then the efficiency is even better: 98.1%.

ECDF - nearly there...

Thanks to the efforts of Greig and Sam, ECDF now has storage set up. Not a lot of storage (just 40MB), but it proves the headnode is working and the information system is correctly configured.

This means we are now fully passing SAM tests. Hooray!

Of course, passing SAM tests is only the first step though, and there are 3 outstanding issues which I have discovered using an atlas production certificate:

I was mapped to a dteam account when I submitted my job (not quite as bad as you think - I am obviously in dteam and this was the default grid-mapfile mapping after LCMAPS had failed).
There's no 32bit python - this has been passed to Ewan for dealing with (along with the list of other compat 32 RPMs.
There's no outbound http access. This hobbles a lot of things for both ATLAS and LHCb.

It feels like we're in the home straight at last though!

Thursday, November 15, 2007

Maui: MAXPROC vs MAXJOBS

One thing which was always desirable in the batch system was to guarantee a number of jobs slots for groups, irrespective of their fairshare usage. We actually want to encourage opportunistic usage of resources, and not punish people by then refusing to run any of their jobs for a week.

However, attempts to set a soft MAXPROC limit always seemed to come to grief. Maui would block jobs beyond the soft limit, even though, as far as I could see, it had been told not to. in frustration I had to set all groups with a soft limit at the cluster size.

Today, I had a chat with the great maui guru, Steve T, who was also somewhat puzzled by maui's behaviour. He pointed out that he'd only ever set MAXJOBS, not MAXPROC. Well, I thought I would give that a whirl and it works!

So, finally we can have a system which protects some slots for VOs and groups, but allows for full opportunistic use of the cluster for everyone.

Thanks Steve!

Wednesday, November 14, 2007

Maui Madness

Maui has been driving me mad for about 3 weeks now. When we upgraded the cluster I had forgotten that moving to pooled prd and sgm accounts would mean that these groups were independent of the normal VO fairshare. As our engineers started to become more active I was unable to get any ATLAS jobs to start at all - particularly atlasprd jobs. As I tried to add fairshare for the new groups maui started to lose the plot, just dropping groups entirely from its fairshare groups - you can see the effect very clearly from MonAMI's maui plots - groups just evaporate!

Fed up with this, tonight, I stopped maui, removed all its databases and restarted it. This, of course, means it's lost its current fairshare calculations, but at least it now has fairshares for the new groups.

I have also re-jigged the fairshare algorithm to have far less of a decay on it - users who ran 7 day jobs were at a huge advantage because by the time their job had finished its first day of running was weighted by 0.3, so it almost didn't count!

Grids, ScotGrid and GU

I gave a talk to the Distributed IT Group at the University today, entitled Grids, ScotGrid and GU: Computing from here to eternity? It was a general introduction to EGEE grids and contained some specific information on how to get started with the Glasgow cluster.

You can get the talk here.

First Jobs Run at ECDF

Finally, after several months of anguish, SAM jobs are running at ECDF! Note that for the moment they fail replica management tests (there was little point in putting effort into the DPM while the CE was so broken), but at last we're getting output from SAM jobs coming back correctly.

The root cause of this has been networking arrangements which were preventing the worker nodes from making arbitrary outbound connections. Last week we managed to arrange with the systems team to open all ports >1024, outbound, from the workers. Then it was a matter of battering down each of the router blocks one by one (painfully these seemed to take about 2 days each to disappear).

Testing will now continue, but we're very hopeful that things will now come together quickly.

Monday, November 05, 2007

nagios monitors are Go!

it's been long overdue on the TODO list but we finally got nagios nrpe installed and configured on the worker nodes. We're now checking for locally logged in users (should only be sysadmin staff locally), high loads, processes, zombies and most importantly disk free.

Few pointers that may help others. 1) cfengine splays for 30 mins. This means if you enable a check before the plugsins are pushed out to the node it fills your mailbox. 2) if you normally use

define service{
hostgroup_name workernodes

then you'll find your testing runs on ALL workernodes. use host_name node001 (or equiv) for testing new services.
3) cfengine saves you pushing the same config out manually - and it also has the nice side effect of restarting nrpe (a necessary process) automatically when it realises nrpe.cfg has changed

Friday, November 02, 2007

M5: The data cometh...

Data from the ATLAS M5 cosmics run started to flow into the UK yesterday. Looks like Glasgow has managed to get all of the subscribed datasets:

M5.0029118.Default.L1TT-b11100100.ESD.v13003010
COMPLETE: BNLDISK,BU_DDM,CNAFDISK,GLASGOW,MWT2_IU,MWT2_UC,NAPOLI,RALDISK
M5.0029118.Default.L1TT-b00001000.ESD.v13003010
COMPLETE: BNLDISK,BU_DDM,CNAFDISK,GLASGOW,MILANO,MWT2_IU,MWT2_UC,RALDISK
M5.0029118.Default.L1TT-b11101000.ESD.v13003010
COMPLETE: BNLDISK,BU_DDM,CNAFDISK,GLASGOW,MWT2_IU,MWT2_UC,NAPOLI,RALDISK
M5.0029120.Default.L1TT-b00000010.ESD.v13003010
COMPLETE: BNLDISK,BU_DDM,CNAFDISK,GLASGOW,MWT2_IU,MWT2_UC,NAPOLI,RALDISK
M5.0029118.Default.L1TT-b00000010.ESD.v13003010
COMPLETE: BNLDISK,BU_DDM,CNAFDISK,GLASGOW,MILANO,MWT2_IU,MWT2_UC,RALDISK
M5.0029118.Default.L1TT-b11101110.ESD.v13003010
COMPLETE: BNLDISK,BU_DDM,CNAFDISK,GLASGOW,MWT2_IU,MWT2_UC,NAPOLI,RALDISK
M5.0029118.Default.L1TT-b00000011.ESD.v13003010
COMPLETE: AGLT2,BNLDISK,BU_DDM,CNAFDISK,GLASGOW,MILANO,MWT2_IU,MWT2_UC,RALDISK
M5.0029118.Default.L1TT-b11101011.ESD.v13003010
COMPLETE: BNLDISK,BU_DDM,CNAFDISK,GLASGOW,MWT2_IU,MWT2_UC,NAPOLI,RALDISK
M5.0029120.Default.L1TT-b00000001.ESD.v13003010
COMPLETE: BNLDISK,BU_DDM,CNAFDISK,GLASGOW,MWT2_IU,MWT2_UC,NAPOLI,RALDISK
M5.0029118.Default.L1TT-b00000110.ESD.v13003010
COMPLETE: BNLDISK,BU_DDM,CNAFDISK,GLASGOW,MILANO,MWT2_IU,MWT2_UC,RALDISK

Tuesday, October 30, 2007

Biomed Abuse

Alerted by Dave Colling, we found a biomed user who was running about 100 jobs on the cluster all trying to factorise a 768 bit number (and win $50 000 in the process).

Clearly this is abuse of our resources and nothing to do with biomed. They have consumed more than 80 000 normalised CPU hours since September. I'm sure the operational costs of this amount to more than several thousand pounds (should we bill them?).

It was all the more irritating as we had a stack of ATLAS and local users' jobs to run, but the biomed jobs were set to download subjobs from the user's job queue (they were effectively limited pilots) and so they ran right up to the wallclock limit of 36 hours.

I banned the user, deleted their jobs and sent a very angry GGUS ticket.

As a slight aside one notes the efficiency of pilot job systems at hoovering up spare jobs slots and consuming resources on cluster well in excess of the nominal 1% fair share we give to biomed.

NGS VO Supported

I should have now fixed the NGS VO on ScotGrid. This was supposed to happen during the SL4 transition, but got lost in the rush to get the site up and running correctly.

LCMAPS now knows to map certificates from the ngs.ac.uk VO to .ngs pool accounts, and the fallback grid-mapfile is being correctly made.

What remains to be sorted out is if we can support NGS without giving shell access. I am loath to turn on pool mapped gsissh as we'd have to hack the grid-mapfile on the UI (to exclude other VOs) and somehow manage to synch the pool account mappings between the UI and the CE.

At the moment I'm sure it will be broken, because we don't even have shared home areas, but even if introduce these I'm not certain it will be a practical proposition to only offer gsiftp.

Some Queue Work

I introduced a new long queue last night, which is primarily to support long validation jobs run by some of our Durham phenomenology friends. It has a 7 day CPU/Wall limit. As I was messing about with transitions to queues which support more than 1 VO, I added support for ATLAS on the queue, thinking it might be of use to our local ATLAS users.

Unfortunately, as it was advertised in the information system, and was a perfectly valid and matchable queue, we soon got production and user ATLAS jobs coming in on this long queue. So, tonight, I have stopped advertising it for ATLAS, to push things back to the normal 36 hour limited grid queue.

I also discovered that maui does its "group" fairsharing on the submitting user's primary group. This is pretty obvious, of course, but as we've traditionally had one queue per VO, everyone in that VO in a primary group of the same name, somehow I had it muddled in my mind that it was a queue balancing act instead. Turns out this is in fact good, because we can set different fair shares for users sharing the gridpp queue, as long as we keep them in per-project primary groups.

Thursday, October 25, 2007

... and we're back

OK - Estates and buildings found enough spare change to last us through for another couple of hours. One workernode casualty (so far) but otherwise things seem to be OK. Will monitor for sanity then remove us from unscheduled downtime.

Glasgow Down - Power Outage

We lost power in the Kelvin Building this morning just before 9am. News is:

Power outage
25/10/07 08:49 BST
We have been informed by Estates and Buildings that there is a power
outage in the Kelvin Building among others. The Kelvin buildings
router provides network services to other buildings and so there will
be no network service in the following buildings:
Kelvin, Davidson, West-medical, Pontecorvo and Anderson colleges,
Joseph-Black, Bower, Estates, Garage, Zoology and shares the medical
school with Boyd-Orr.
Scottish power are working on the problem. Further information as it
becomes available.

We've EGEE broadcast and put in an unscheduled downtime.

Really annoying - we were running along very nicely when this happened (although we'd had to shut down 20 WNs because of the loss of an a/c unit on Tuesday).

Glasgow site missing, presumed AWOL

This morning uki-scotgrid-glasgow and other systems in the same building are offline. We're not sure of cause yet and wil investigate further once someone's on site.

UPDATE - 09:08 - Confirmed as power cut affecting building (see http://www.gla.ac.uk/services/it/helpdesk/#d.en.9346) No ETA yet for restoring services. EGEE Broadcast sent.

Wednesday, October 24, 2007

News from ATLAS Computing

There is some significant news for sites from ATLAS software week. The decision was taken yesterday to move all ATLAS MC production a pilot job system, with the pilots based on the Panda system developed by US ATLAS. The new system will get a new name, pallete and pallas are the front runners. (I like pallas myself.)

In addition EGEE components, such as LFC, will be standardised on for DDM.

As this is a pilot job system, so the anticipated model is that ATLAS production will keep a steady stream of pilots on ATLAS T2 sites, which pull in real job payloads from the central queue.

This is very like the LHCb MC production model, so as the transition is made to this system sites should start to see much better usage of their resources by ATLAS - just like LHCb are able to scavenge resources from all over the grid.

Canada have recently shifted to Panda production, and greatly increased their ATLAS workrate as a result. There have been some trials of the system in the UK and France, which were also very encouraging.

Of course, there is a very large difference with LHCb, because ATLAS don't just do simulation at T2s, but also do digitisation and reconstruction. These steps require input files and the panda based system for dealing with this is to ensure that the relevant input dataset is staged by the ATLAS distributed data management system (DDM) on your local SE; further the output dataset will be created on your local SE, then DDM will ship it up to the Tier-1 after the jobs have run.

This means that sites will really require to have a working storage system for ATLAS work from now on (in the previous EGEE production model, any of the SEs in your cloud could be used, which masked a lot of site problems but caused us huge data management headaches).

In the end using pilots had two compelling advantages. Firstly, the sanity of the environment can be checked before a job is actually run, which means that panda gets 90% job efficiency (the other EGEE executors struggled to reach 70%). Secondly, and this is the clincher, it means that we can prioritise tasks within ATLAS, which is impossible to do otherwise.

At the moment the push will be to get ATLAS production moved to this new system - probably on a cloud by cloud basis. This should not cause the sites headaches as production is a centralised activity (most sites still have a single atlasprd account anyway). However, the pilots can also run user analysis jobs - and this will require glexec functionality. Alessandra and I stressed to Kors that this must be supported in the glexec non suid mode.

In the UK ATLAS community we now need to get our DQ2 VOBox working properly - at the moment dataset subscriptions in the UK are just much too slow right now.

Postscript: I met Joel at lunchtime, who wanted me to namecheck LHCb's DIRAC system, as panda is based on DIRAC - well, I didn't know that, but I suppose I'll learn a lot more about the internals of these things in the next few months.

Monday, October 22, 2007

EGEE: It's broken but it works...

I had meant to blog this a week ago, but I did my first shift for ATLAS EGEE production. This was really being thrown in at the deep end, as the twiki instructions were, well spartan, to say the least. (And definately misleading in places, actually.)

I was on Data Management, which meant concentrating on problems of data stage-in and stage-out, as well as trying to pick up site which had broken tool sets.

It felt like a bit of a wild ride - there are almost always problems of some kind, and part of the art is clearly sorting out the completely urgent must be delt with now, from the simply urgent, down to the deal with this in a quiet moment.

I found problems at T1s (stage-in, overloaded dCaches, flaky LFCs), T2s (SEs down, quite a few broken lcg-utils, some sites just generically not working but giving very strange errors). I raised a lot number of GGUS tickets, but sometimes it's very difficult to know what the underlying problem is, and it's very time consuming batting the ticket back and forth with the site.

It's a very different experience from being on the site side. Instead of a "deep" view of a few sites you have a "shallow" view of almost all of them. If you want to read my round-up of issues though the week, it's on indico (it's the DDM shifter report).

Disk Servers and Power Outages

Your starter for 10TB: how long does it take to do an fsck on 10TB of disk? Answer: about 2 hours.

Which in theory was fine - in at 10am, out of downtime by 12. However, things didn't go quite according to plan.

First problem was that disk038-41, which had been setup most recently, had the weird disk label problem, where the labels had been created with a "1" appended to them. Of course, this wouldn't matter, except that we'd told cfengine to control fstab, based on the older servers (with no such ones), so the new systems could not find their partitions and sat awaiting root intervention. That put those servers back by about an hour.

Secondly, some of the servers were under the impression that they had not checked their disks for ~37 years (mke2fs at the start of the epoch?), so had to be coaxed into doing so by hand, which was another minor hold up.

Third, I had decided to convert all the file systems to ext3, to avoid protect them in the case of power outages. It turns out that making a journal for a 2TB filesystem actually takes about 5 minutes - so 25 minutes for your set of 5.

And, last but not least, the machines needed new kernels, so had to go through a final reboot cycle before they were ready.

The upshot was that we were 15 minutes late with the batch system (disk037 was already running ext3, fortunately), but an hour and 15 minutes late with the SRM. I almost did an EGEE broadcast, but in the end, who's listening Sunday lunchtime? It would just have been more mailbox noise for Monday, and irrelevant by that time anyway.

As the SRMs fill up, of course, disk checks will take longer, so next time I would probably allow 4 hours for fscking, if it's likely.

A few other details:

* The fstab for the disk servers now names partitions explicitly, so no dependence on disk labels.
* The BDII going down hit Durham and Edinburgh with RM failures. Ouch! We should have seen that one coming.
* All the large (now) ext3 partitions have been set to check every 10 mounts or 180 days. The older servers actually had ~360 days of uptime, so if this is an annual event then doing an fsck once a year should be ok.

Sunday, October 21, 2007

Glasgow power outage

Ho Hum, it's that time of year for HV switchgear checking. Rolling programme of work across campus meant that the building housing the glasgow cluster was due for an outage at ungodly-o-clock in the morning. We arranged the outage in advance, booked scheduled downtime. All OK. Then after G had taken some well deserved hols I discovered how dreadful the cic-portal was to send an egee broadcast. I want to tell users of the site it;s going down. Any chance of this in english? RC management? What is 'RC' - doesn't explain it. Then who should I notify? agan no simple descriptions... Grr. Rant..

OK - system went down cleanly easily enough (pdsh -a poweroff or similar) - bringing it back up? hmm. 1st off the LV switchboard needed resetting manually so the UPS has a flat battery. Then one of the PDU's decided to restore all the associated sockets to 'On' without waiting to be told. (so all the servers lept into life before the disks were ready). Then the disk servers decided they needed to fsck (it'd been a year since the last one) - slooooow. Oh, and the disklabels on the system disks were screwed up (/1 and /tmp1 rather than / and /tmp for exmple) - another manual workaround needed.

Finally we were ready to bring the workernodes back - just on the 12:00 deadline. I left graeme still hard at it, but there's a few things we'll need to pull out in a post mortem. I'm sure Graeme will blog some more

Wednesday, October 10, 2007

svr016 (CE) sick

Our CE appears to have fallen over this morning at 7AM. It's not responding to SSH (well, with a load like that I'm not suprised) - It'll need poking as soon as someone's on site.

UPDATE - 10:00. Unresponsive to even the 'special' keyboard we keep for such events. Needed the big red button pressing. Seems to have come back OK. Monitoring for fallout.