ScotGrid: November 2007

Tuesday, November 27, 2007

Dem Biomed Blues...

Finally I got fed up with the biomed user whose jobs always stall on the cluster. Banned them and sent in a ticket.

I'm not prepared to be tolerant of crap code when we have 100s of queued jobs.

Monday, November 26, 2007

Health and Efficiency (ATLAS style)

Now that the conversion to the new ATLAS MC production system (panda/pallette/pangea?) is underway, I thought it would be interesting to compare the site's view of efficiency in the new system to the old. I had to fix up our local accounting database, which was truncating some of the longer username fields we have now (e.g., prdatlasNNN). After doing that, I could easily distinguish between panda pilots and other production activities.

Since we upgraded to SL4 in September (which was just about the time that Rod started toying with panda) the scores are:


Lexor/Cronus
+-------+----------+-----------+-----------+
| Jobs  | CPU_Hours| Wall_Hours| Eff       |
+-------+----------+-----------+-----------+
| 20047 | 282434.6 | 533904.0  | 0.52899   |
+-------+----------+-----------+-----------+

Panda
+-------+------------+-----------+------------+
| Jobs  | CPU_Hours  | Wall_Hours| Eff        |
+-------+------------+-----------+------------+
| 17746 | 57312.5925 | 59600.919 | 0.96160584 |
+-------+------------+-----------+------------+

This is quite a different view of "efficiency" to the VO's view, because here the actual success or failure of the job is masked - we're only looking at wall time efficiency in the batch system. However, the improvement here is spectacular, so sites should, I think, be very happy with this change.

Note that the panda figures include all the pilots, even the ones which had no jobs to pick up (production stalled a few times because of dCache problems at RAL and other teething troubles). If one masks these jobs out then the efficiency is even better: 98.1%.

ECDF - nearly there...

Thanks to the efforts of Greig and Sam, ECDF now has storage set up. Not a lot of storage (just 40MB), but it proves the headnode is working and the information system is correctly configured.

This means we are now fully passing SAM tests. Hooray!

Of course, passing SAM tests is only the first step though, and there are 3 outstanding issues which I have discovered using an atlas production certificate:

I was mapped to a dteam account when I submitted my job (not quite as bad as you think - I am obviously in dteam and this was the default grid-mapfile mapping after LCMAPS had failed).
There's no 32bit python - this has been passed to Ewan for dealing with (along with the list of other compat 32 RPMs.
There's no outbound http access. This hobbles a lot of things for both ATLAS and LHCb.

It feels like we're in the home straight at last though!

Thursday, November 15, 2007

Maui: MAXPROC vs MAXJOBS

One thing which was always desirable in the batch system was to guarantee a number of jobs slots for groups, irrespective of their fairshare usage. We actually want to encourage opportunistic usage of resources, and not punish people by then refusing to run any of their jobs for a week.

However, attempts to set a soft MAXPROC limit always seemed to come to grief. Maui would block jobs beyond the soft limit, even though, as far as I could see, it had been told not to. in frustration I had to set all groups with a soft limit at the cluster size.

Today, I had a chat with the great maui guru, Steve T, who was also somewhat puzzled by maui's behaviour. He pointed out that he'd only ever set MAXJOBS, not MAXPROC. Well, I thought I would give that a whirl and it works!

So, finally we can have a system which protects some slots for VOs and groups, but allows for full opportunistic use of the cluster for everyone.

Thanks Steve!

Wednesday, November 14, 2007

Maui Madness

Maui has been driving me mad for about 3 weeks now. When we upgraded the cluster I had forgotten that moving to pooled prd and sgm accounts would mean that these groups were independent of the normal VO fairshare. As our engineers started to become more active I was unable to get any ATLAS jobs to start at all - particularly atlasprd jobs. As I tried to add fairshare for the new groups maui started to lose the plot, just dropping groups entirely from its fairshare groups - you can see the effect very clearly from MonAMI's maui plots - groups just evaporate!

Fed up with this, tonight, I stopped maui, removed all its databases and restarted it. This, of course, means it's lost its current fairshare calculations, but at least it now has fairshares for the new groups.

I have also re-jigged the fairshare algorithm to have far less of a decay on it - users who ran 7 day jobs were at a huge advantage because by the time their job had finished its first day of running was weighted by 0.3, so it almost didn't count!

Grids, ScotGrid and GU

I gave a talk to the Distributed IT Group at the University today, entitled Grids, ScotGrid and GU: Computing from here to eternity? It was a general introduction to EGEE grids and contained some specific information on how to get started with the Glasgow cluster.

You can get the talk here.

First Jobs Run at ECDF

Finally, after several months of anguish, SAM jobs are running at ECDF! Note that for the moment they fail replica management tests (there was little point in putting effort into the DPM while the CE was so broken), but at last we're getting output from SAM jobs coming back correctly.

The root cause of this has been networking arrangements which were preventing the worker nodes from making arbitrary outbound connections. Last week we managed to arrange with the systems team to open all ports >1024, outbound, from the workers. Then it was a matter of battering down each of the router blocks one by one (painfully these seemed to take about 2 days each to disappear).

Testing will now continue, but we're very hopeful that things will now come together quickly.

Monday, November 05, 2007

nagios monitors are Go!

it's been long overdue on the TODO list but we finally got nagios nrpe installed and configured on the worker nodes. We're now checking for locally logged in users (should only be sysadmin staff locally), high loads, processes, zombies and most importantly disk free.

Few pointers that may help others. 1) cfengine splays for 30 mins. This means if you enable a check before the plugsins are pushed out to the node it fills your mailbox. 2) if you normally use

define service{
hostgroup_name workernodes

then you'll find your testing runs on ALL workernodes. use host_name node001 (or equiv) for testing new services.
3) cfengine saves you pushing the same config out manually - and it also has the nice side effect of restarting nrpe (a necessary process) automatically when it realises nrpe.cfg has changed

Friday, November 02, 2007

M5: The data cometh...

Data from the ATLAS M5 cosmics run started to flow into the UK yesterday. Looks like Glasgow has managed to get all of the subscribed datasets:

M5.0029118.Default.L1TT-b11100100.ESD.v13003010
COMPLETE: BNLDISK,BU_DDM,CNAFDISK,GLASGOW,MWT2_IU,MWT2_UC,NAPOLI,RALDISK
M5.0029118.Default.L1TT-b00001000.ESD.v13003010
COMPLETE: BNLDISK,BU_DDM,CNAFDISK,GLASGOW,MILANO,MWT2_IU,MWT2_UC,RALDISK
M5.0029118.Default.L1TT-b11101000.ESD.v13003010
COMPLETE: BNLDISK,BU_DDM,CNAFDISK,GLASGOW,MWT2_IU,MWT2_UC,NAPOLI,RALDISK
M5.0029120.Default.L1TT-b00000010.ESD.v13003010
COMPLETE: BNLDISK,BU_DDM,CNAFDISK,GLASGOW,MWT2_IU,MWT2_UC,NAPOLI,RALDISK
M5.0029118.Default.L1TT-b00000010.ESD.v13003010
COMPLETE: BNLDISK,BU_DDM,CNAFDISK,GLASGOW,MILANO,MWT2_IU,MWT2_UC,RALDISK
M5.0029118.Default.L1TT-b11101110.ESD.v13003010
COMPLETE: BNLDISK,BU_DDM,CNAFDISK,GLASGOW,MWT2_IU,MWT2_UC,NAPOLI,RALDISK
M5.0029118.Default.L1TT-b00000011.ESD.v13003010
COMPLETE: AGLT2,BNLDISK,BU_DDM,CNAFDISK,GLASGOW,MILANO,MWT2_IU,MWT2_UC,RALDISK
M5.0029118.Default.L1TT-b11101011.ESD.v13003010
COMPLETE: BNLDISK,BU_DDM,CNAFDISK,GLASGOW,MWT2_IU,MWT2_UC,NAPOLI,RALDISK
M5.0029120.Default.L1TT-b00000001.ESD.v13003010
COMPLETE: BNLDISK,BU_DDM,CNAFDISK,GLASGOW,MWT2_IU,MWT2_UC,NAPOLI,RALDISK
M5.0029118.Default.L1TT-b00000110.ESD.v13003010
COMPLETE: BNLDISK,BU_DDM,CNAFDISK,GLASGOW,MILANO,MWT2_IU,MWT2_UC,RALDISK

ScotGrid