ScotGrid: June 2007

Tuesday, June 26, 2007

RB Twiddles

Steve Lloyd's started to monitor our RB. So far things look good! I also discovered that SAM had picked up the RB and was at least checking its host certificate. I have now added the RB to the GOC and requested monitoring - another pair of eyes to pick up problems.

As we are getting rather more officially known, I've reduced the number of advertised VOs on the Glasgow RB to ops, dteam (for testing), atlas, pheno and gridpp. There's a slightly different site-info.def used for the RB to do this (site-info-rb.def) and I changed cfengine to run against this file if the node needs rebuilt.

Even though the RB has only been up for 3 months the MySQL database is now 270MB. Although this is well below the 4GB where Olivier has reported problems, we will have to monitor this to ensure we don't get into stability problems.

At the moment, though, the machine is very lightly loaded.

Monday, June 25, 2007

Tails and Spikes

Tony and I have been trying to draft a policy on killing off jobs which just fail to start properly, so I pulled some stats out of our local accounting MySQL database and plotted a histogram of job efficiencies. This is a very interesting plot - a clear "decay" down from high efficiency into a long tail, then a significant spike of very low efficiency jobs (< 0.02).

Jeremy said that Dario was quite sanguine about killing these sorts of jobs off - things which fail to consume CPU after 6 hours are probably never going to get anywhere.

However, it turns out this is a bit of a can of worms. The RB will resubmit the job (up to 3 times) and the same thing might happen again on a different site. On the other hand, jobs running out of wall clock look the same to the user - and the RB will also resubmit them! If we do kill off jobs, should we email the user? Is this scalable in terms of our time? How much information do we provide to the end user? Will they even care?

It will be an interesting discussion.

Glasgow RB now in BDII

After humming and hawing about our "trial" RB, I decided to bite the bullet and publish it into the site, and thence top level, BDII. This makes things easier for the RTM, among other things. So far the RB has been really trouble free to run and had been lightly loaded - fingers crossed.

I think our "policy" here should be to support VOs in which we have a strong interest, e.g., ATLAS, gridpp, pheno, ScotGrid VOs. This will mean splitting our site-info.def file, though, because there will be a different list of supported VOs on the RB to the CE/SE.

Of course, the LCG-RB's days should be numbered - it seems that the gLite WMS now outperforms it. If this is true then there really is little point in trying to learn about it, instead we should put our efforts into the new system.

BDII Status

After moving back to our own BDII things have been more or less ok - we had a wobble on Sunday with 2 timeouts. Overall the load on svr019 is definitely creeping up again though - is there some systematic involved here?

I met Steve Traylen in R1 at CERN at lunchtime and he said that the indexing patch for the BDII has already been applied at CERN, so it seems that this is something which it's fairly safe to grab. I'll put it on the TODO list for next week.

I seem to be spending a disproportionate amount of my time worrying about BDII failures!

Thursday, June 21, 2007

Back to our own BDII

The RAL BDII timed out three times in 24 hours, so I've moved back to the ScotGrid one.

Oh, for a working information system...

lftp woes

Fed up with lftp mirroring being broken.

Discovered a nice simple sanity check though for making sure we don't do a Greig:

grep ^lcd /etc/mirror.conf | awk '{print "ls -ld " $2}' | /bin/sh

Tuesday, June 19, 2007

Ganga Quickstart Guide Complete

I have finished the last few sections of the Glasgow Ganga Quickstart Guide.

In particular I have covered bulk job submission using ganga.

Swetha and I used the methods described here to submit a batch of 12 of her jobs onto the cluster this afternoon. She's very pleased to have a scalable method to submit large numbers of jobs onto the cluster - previously she was limited to ~5 jobs on the computing service cluster, which she also had to run interactively. This is much better.

Brain transplant leftovers

Bah! yum update has failed to get the headers (404) when tested on a client node and the pxe/dhcp/tftp part of the build process still needs working on. We've also got 2 nodes that require a re-image to bring them under the full control of the build system. Looks like today is going to be spent making sure the YPF is working properly.

Stuff that needs to be done:
* clusterdb to dnsmasq - it should spit out the MAC addressess to file ready to use.
* Check all the apache aliases for build
* check the autokick stuff.

then reimage machines. Hopefully that'll reduce the no of red bits in nagios down to a less boss-scaring amount. Once thats done we can work on removing the false alarms (where we monitor non-existing services that we know are't configured yet)

Monday, June 18, 2007

Switched Back to RAL BDII

We failed another couple of CE-RM tests over the weekend. It's clear this started when the load on the machine crept up from ~0.5 to ~1.0 about 2 weeks ago. I didn't change anything on the box, so I am mystified as to what's caused this change. Perhaps it's a greater load being put on the BDII by expanded use of the RB?

I have switched back to using the RAL BDII for the moment and we haven't failed since then. I may setup an additional top level BDII on svr017, which is the unloaded scotgrid admin node, and see if that has a lower overall load.

I will also upgrade the BDII to the new release, which uses indexes which speeds up queries, and see if that helps.

Queue Tweaks and Maui

As we've been really full recently, I have reduced the maxwallclock available to grid VOs from 148 to 100 hours. The maxcpu time stays the same at 96 hours. I'm growing very frustrated with jobs which just stall at the start - we had 9 atlas jobs which consumed 1s of cpu time in 9 hours, hanging on an lcg-cp.

I also increased the maxcpu and wallclock on the gridpp queue to 168 hours, to make sure that Swetha's bio jobs run through ok - 96 hours was probably too close to the wire. We can cut our local users a bit more slack as their jobs, when they do run, tend to be almost 100% efficient.

I had to reduce the maxprocs on the glee queue to 400 - we can't really afford to get the whole cluster filled with EE jobs as their maxcpu/wallclock is so high at 28 days, and this will completely mess up fairsharing.

We've been suffering this weekend from two very large job surges from pheno and glee. As these groups have a large, but underused, fairshare, they get to start an awful lot of jobs in a short time - and as they run for a very long time then the cluster starts to suffer from very few jobs slots coming free and can't run anything for anyone else. 100hr/560 is 11 minutes, but the job inflow, when local users are involved, is far from uniform.

I would like to move the *sgm jobs from atlas and lhcb into the dteam/ops reserved job slot. I will have to ask Sam how to do this.

Friday, June 15, 2007

A Less Reliable Week

After weeks of perfection, I'm now picky about even a small number of failures.

We had 4 BDII timeouts this week, which is worrying. My inclination is to give R-GMA a hard stare - its close waits have still been misbehaving again and we see load/network spikes which seem to be R-GMA related. However, the BDII was perfect even when R-GMA was occupying more than 1000 close waits, so perhaps it's somewhat unclear. When I looked with top, it actually seemed that most of the CPU was being consumed by slapdadd and slapd. Well, one to keep an eye on.

We also had a copule of JS failures with the error message "Got a job held event, reason: Globus error 79: connecting to the job manager failed.". The GOC wiki suggests this is probably a networking problem, however, that seems unlikely. Is this the gatekeeper deciding that the presented certificate does not match that of the job submitted? There's also a suggestion that the GLOBUS_TCP_PORT_RANGE might be wrong, but we've never changed this from the default 20000-25000 range, so that also seems unlikely.

Again, this probably requires some detailed examination of the gatekeeper logs to see if the connection got through.

Start Them Young...

Spent most of the last 2 days showing S5/6 pupils round the computer cluster and talking about the LHC. It was quite fun, but rather difficult to give a talk in an air conditioned noisy room - but their time was so short there was no chance of taking them to a quieter room to speak to them. I scooped up various posters from around the department (proton model, aerial view of LHC, ATLAS, ScotGrid poster) and had the RTM running on the projector.

Unfortunately RTM was a bit broken, but they could still see Europe pulsing with computer centres and be impressed.

Impressive storage facts (measured in 8GB iPod units):

The LHC would fill an 8GB iPod in < 10s
If you stored a year of LHC data on 8GB iPods the pile would be 12.5km high.
If you tried to listen to all of that iPod data as music, it would take you almost 30 000 years.

School kids often affect extreme levels of disengagement, but most of them seemed to enjoy it.

Come and study physics then!

Wednesday, June 13, 2007

Bad Worker Node, Bad, Bad...

node040 had the wrong gateway, via svr031 instead of nat005. When I gated it through nat005 external networking started to work. When svr031 was reinstalled we turned off packet routing on svr031 - however at the time we switched the rest of the cluster to nat005 (when svr031 lost its brain) node040 was down. It never got the change and the default gateway is set at install time - not controlled by cfengine.

In theory this should never happen again - all the install time network files are now correct and nodes should be re-installed when they are brought back into service. However, a nagios monitor which tests external networking should be implemented.

The reason the problem kicks in when the cluster gets full is that node040 can't run jobs (can't get the input sandbox from the RB), so the RB job wrapper gives up after ~5 minutes. Frees up the job slot and another job gets sucked in to its doom. When the cluster is less full this is much more sporadic, as a resubmitted job is likely to go to a different (functioning) node. When I looked at the logs I saw that node040 killed a remarkable number of jobs in its 8 day reign of terror: 1835 out of 11885 have been sent to their doom (15.4%).

For interest, I wrote a little python log parser, which can print number of jobs, avg. cpu and wall time (in minutes). Even just running it over Thursday's pbs log shows up node040 as a bad place to be:

svr016:/var/spool/pbs/server_priv/accounting# nodestat.py 20070607 | sort -k 2 -n
node043: 4 1069.9 1165.9
node045: 4 1080.3 1178.0
node048: 4 1185.8 1195.9
node013: 5 1171.3 1345.5
node032: 5 959.9 1030.7
[...]
node088: 12 438.0 445.3
node019: 13 433.7 437.4
node023: 13 667.8 719.0
node103: 13 397.1 430.7
node139: 62 70.4 80.3
node040: 646 0.5 4.7

Note that node139 is the node most of the ops tests run on, because they have a reserved job slot - even though it's not strictly tied to a node it rarely runs anywhere else. This is really a pain, because I'm sure we'd have picked up the problem earlier - ops jobs never get resubmitted. Perhaps we should remove that reservation. I'm fairly confident that with 500+ job slots we'll get something coming free in under an hour. (6 days / 500 ~ 20min).

Thursday, June 07, 2007

Glasgow CE Flaky

It looks like we're having CE problems at Glasgow. We failed a SAM test at 1am, with the error "Globus error 79: connecting to the job manager failed. Possible reasons: job terminated, invalid job contact, network problems, ...". This is very reminiscent of the errors seen by our engineers who use GRAM submission.

We also seemed to fail to run atlassgm tests for long enough that we've been blacklisted by ATLAS in the FCR - though here the tests are just missing, so I don't know what went wrong.

We passed a test at 1030, so hopefully we'll be back in soon.

It's urgent that I get to the bottom of this.

I have checked the gatekeeper logs, and the jobs are being mapped properly to atlassgm at regular intervals. I have checked the WNs and there's nothing evil here - ssh working, disks not full, nfs mounts ok. The exit status of all the jobs from the batch system is 0. The failing jobs were not consistently given to one WN, which might explain the issue. We even passed tests on node040 yesterday, then failed in the early hours of this morning.

Help!

Tuesday, June 05, 2007

An Unhappy Night With Steve's Tests

I'm very pleased that Glasgow is the top site for Steve's ATLAS tests, but last night we seemed to have a miserable time, failing ~10/30 tests. These were all ABORTS from the RB, which was the IC RB in each case. And all of the successful jobs also came from the IC RB - so it wasn't that we had completely fallen out. And Glasgow's the only site affected, so I think it must be a site issue. However, there just isn't enough information in the logfiles to be able to tell why the job's aborting.

The first thing I checked was autofs (I added a new map yesterday), but this was ok. /home and /tmp are also fine. I'll have to dig into torque and see what I can find.

It really is annoying hard to pin these things down in the baroque dance which is EDG job submission...

Monday, June 04, 2007

SAM tests changed VOMS Role (without warning!)

We, and a large fraction of the rest of the grid, started to fail replica management tests late on Friday night. At first I thought it must be a catalog problem at cern, so I raised a ticket. However, it turned out that what had actually happened was that the VOMS role used to submit the SAM tests had changed. This caused DPM to map the SAM tester DN into a different group - who then did not have permission to write into the default generated directory for lcg-cr.

This change was made completely unannounced, and I suspect without any real thought as to the implications for sites using DPM 1.6.3 and earlier.

Maarten Litmath helpfully posted a fix-up script on LCG-ROLLOUT, which uses ACLs to grant suitable privileges to lcgadmin and production roles for each supported VO, which I applied to Glasgow and Durham at about midnight last night (I was surely violating cardinal rules of sysadmining, but I couldn't see how it would cause harm - and this time I got away with it). This did fix the problem.

I'm really annoyed that this though. Changes like this should never, ever be made on a Friday! (It seemed the change actually came through at ~10am, but didn't break until midnight, when the next YYYY-MM-DD directory needed to be created.) In addition several people have commented that the fix is to upgrade to DPM 1.6.4 - despite the fact that this is broken in gLite 3.0r25 in two significant ways!

Grrrr. I just hope they don't ask us to explain these SFT failures - they shall have a piece of my mind... (I sound just like my Mum, when she was annoyed - see what the grid's doing to me!).

Sunday, June 03, 2007

Transfer Tests

Continuing with transfer tests between RAL-T2 and Glasgow over the weekend to investigate the effect of the Streams setting within glite-transfer-channel-set. Turns out it's not a very clear effect - negligible at best. As Greig has already noted that -T 1 is best for dcache, I propose we leave them at that.

However, totalling up the traffic for the weekend, I've moved 6.3 TB since friday afternoon (41h) which gives us a sustained average bandwith of 343Mb/s

But... Thats only 6.3 out of a requested 7.5. Only 7/15 transfers completed sucessfully with all 500 files being transferred. The others copped out with:

(mostly) FTS Reason: Failed on SRM get: Failed To Get SURL. Error in srm__get: service timeout.
(some) FTS Reason: Failed on SRM get: SRM getRequestStatus timed out on get
(twice) FTS Reason: Failed on SRM put: Failed SRM put on httpg://svr018.gla.scotgrid.ac.uk:8443/srm/managerv1 ; id=... call. Error is File exists

So, considering I was transferring the same 50 seed files, thats quite alot of crapness on behalf of the dCache source. Not sure what caused the two false "file exists" failures. I severely doubt that more than 10% failure rate is acceptable to the experiments.

I'll present the full findings (once I've plotted them) at the GDB Meeting on Tuesday

Worth noting that Pauls MonAMI data was v useful in keeping an eye on the dteam specific data pool usage, together with the health of the DPM service at Glasgow.

ScotGrid