ScotGrid

Friday, August 17, 2007

TPM Round Up

It was TPM week again. Probably not spend more than about 8 hours over the week dealing with things, so it remains a background task against which other things can be done. Between Pete and myself we probably had quite a reasonable response time. Spam tickets continue to be greatly annoying - they amounted to ~30-40% of all tickets (I also cleaned them up when they'd been submitted to VO support units, when I saw them).

Pheno Attacks!

Well, to round off a dreadful week in terms of users slapping the whole system about, the CE load spiked and we hit a System CPU storm again this afternoon. Fortunately this time I caught it within an hour and killed off the offending processes, and the system managed to recover ok.

Very oddly it was caused by a pheno user's gatekeeper processes stalling - gobbling CPU, but failing to submit any jobs onto the queue. I had to add their DN to the banned list and kill off these processes pending further investigation.

Action plan:

Nagios alarm on cpu_system > 10%.

ScotGrid and AliEn

After a long hiatus, Dan and I sat down to look at getting the AliEn system submitting properly into ScotGrid. Battling through OO Perl (yuk!) we found the various points where the system seemed to lack configuration information - this was basically a value for the site BDII to query free slots, a CE queue name and the name of the VO (gridpp). When we'd hacked those values into the LCG.pm AliEn file then things started to look a lot rosier.

However, I think that AliEn is going to want a dedicated queue, because of the way it parses the BDII's output, which inclines us towards setting up a VO for panda sooner rather than later.

Glasgow and Phenogrid

We're trialling shell access for some Durham phenogrid users here at Glasgow, giving them the ability to rsync the latest builds of alpgen onto the cluster then fire off grid jobs against them.

Although in some ways this is a step backwards, the current EGEE/LCG software deployment mechanisms are too cumbersome for the kind of work that most of the Durham users want to do. In giving them access to a much larger tranche of ScotGrid resource we do help to promote grid work over local queues at Durham (which is a smaller older cluster). If it works then there's the motivation to find a more generic solution to the software problem over the whole grid.

Durham Network Outage

Durham suffered a network outage last night, losing all connectivity to the outside world. The grid systems themselves recovered well this morning, but we suffered about 7 hours of downtime.

SL4 Upgrade News

There seem to be enough outstanding issues with the SL4 upgrade that we have decided to hold off for now in ScotGrid. We will review the situation again after CHEP (w/o 10 Sept), and if there are no show stoppers at this time, then the week of the 17th will be the upgrade week for Glasgow and Durham.

At Edinburgh the ECDF resource is SL4 anyway, so getting this up and running will be the perfect way to iron out ScotGrid SL4 issues. It also makes more ScotGrid resource availiable, rather than risking currently functioning resources at Glasgow and Durham.

Tuesday, August 14, 2007

Bad, bad, biomed....

A very flaky day - we had a biomed user throw jobs into the system which were trashing workernodes by filling up /tmp. This caused lots of nodes to get into a weird state where they seemed to run our of memory (ssh and ntp nagios alarms firing). Jobs couldn't start properly on these nodes, so they became black holes: one of our local users lost 47 of 50 jobs, another lost 124 of 150.

We then started to fail SAM tests, drop out of the ATLAS BDII, Steve's tests then couldn't resource match us, and so on. Bad day.

It took quite a few hours to sort out the mess, and a further few hours to stabalise the site.

Our GGUS ticket names and shames.

It's a very different ball game on the grid - we have 6000+ users who can submit jobs, and it's not hard to kill a worker node. The torque generally handles this badly and all hell breaks loose.

Action plan:

Nagios alarms on WN disk space
Group quotas on job stratch area

DPM Gridftp Resource Consumption

Durham was suffering from excessive resource consumption, from "hung" dpm.gsiftp connections from ATLAS transfers. Because of the way that gridftp v1 servers work huge buffers were being held in memory, leading to resource exhaustion on the machine and a subsequent crash.

Phil and I discussed this, and I noticed that the active network connections were to the RAL FTS server, not to the source SRM, so it looked like it was the control channel which was hung open, not the data channel.

Greig had a look on Glasgow's servers and discovered the same problem, but we were relatively unaffected due to the whopping 8GB of RAM we have in each disk server (and by having 9 disk servers, presumably). Cambridge also reported problems.

The issue is being looked at by the DPM developers, but for the moment Phil's had to write a cron script to kill off the hung ftps to keep gallow's head above water.

Maximum Queable Jobs

From our two phenogrid DOS attacks, it seems that the maximum number of queued jobs the system can cope with is about 2500. After this the system slides into a crisis, running out of CPU with too many gatekeeper processes active and a context switch storm starts - from which the system can rarely spontaneously recover, it seems.

So, I have set a max_queueable parameter on every queue of 1000, which seems a reasonable number for any single VO or queue.

It seems a limitation of torque that it cannot also have a global cap on queued jobs (at 2500, for instance), but this is only a parameter settable for queues.

Monday, August 13, 2007

MonAMI keeps on working...

Gosh, I tickled pink at how well MonAMI coped with Glasgow's job storm last Thursday.

To put this in context, we had a CE that was running at a constant 100% CPU usage: 50% in kernel (context-switching) and 50% in user-land (running Perl scripts). The ssh daemon wasn't working properly any more: the ssh-client would (almost always) time out because the ssh server would taking too long to fork. The machine's 1-min load average peaked at ~300!

All in all, this was an unhappy computer.

Despite all this, MonAMI just kept on going. As matters got progressively worse, it took longer and longer to gather data, particularly from Maui. From the normal value of less than a second, the Maui acquisition time peaked at around 15 minutes. Torque's faired better, peaking at around 30s (still far longer than normal).

Despite this, MonAMI didn't flood Torque or Maui. It only ever issuing one request at a time and enforced a 1-minute gap between successive requests. MonAMI also altered it's output to Ganglia to compensate for taking 15-times longer than normal. This prevented Ganglia from (mistakenly) purging the metrics.

So, although everything was running very, very slowly and it was difficult to log into the machine, the monitoring kept working and we have a record of what was happening to the jobs.

Incidentally, the failing ssh is why most (all?) of the jobs were going into wait-state: the worker node mom daemons couldn't stage-in, via scp, the files the jobs needed. This would fail the job being accepted by the WN, causing torque (or maui?) to reschedule the job for some time in the future, putting the job in to wait-state.

Welcome Back! (sic): Picking Up The Pieces

Fool that I am, I opened my laptop after getting back from Gairloch on Saturday night. As I now have Paul's MonAMI torque plots on my Google homepage, I could see that the number of running jobs was down to almost zero. This was unexpected. A quick revision of the SAM pages and monitoring plots showed the pheno job storm on Thursday had killed the CE off, big time. Being throughly disinclined to engage in extensive debugging late on Saturday night, and knowing the machine needed a new kernel anyway, I rebooted the CE. Whatever the residual problem was this cleared it. Within minutes LHCb jobs were coming in and starting properly - indeed within about 6 hours they managed to fill the entire cluster again.

The remaining problem was then timeouts on the CE-RM test. This was puzzling, but not critical for the site, so I left things as they were at this point, pending further investigation. Then I recalled that this can happen if a period of intense DPM stress (Greig and Billy have been preparing the CHEP papers) causes threads in DPM to lock up or block, leaving few threads available to service requests. I restarted the DPM daemon and bingo! all was well again. The next time this happens we should look in MySQL for pending requests (which get cleared when DPM restarts), however at 10pm on Sunday night getting things working quickly is all I care about.

My car broke down yesterday too, so I'm off to Halfords to buy it some new brake pads; but at least that happened after I came back from holiday. If only the site had the good grace to do the same. I think I get to be grumpy about this for at least 2 days.

Thursday, August 09, 2007

Pheno goes bang, take two!

A problem started at about 02:45 this morning. The large number of pheno jobs that had accumulated in queued state started fail when run. Once failed, the job would go into waiting state, triggering maui to decide which job to run next.

With the current usage and fairshares, Maui's decision is to run the (apparently) broken pheno jobs. This keeps the server-load high and starves the cluster of long-running jobs (there's been 1-min avr load spikes of over 600!).

Look familiar? Here's a entry with very similar symptoms.

I'm in the process of trying to get to the bottom of what's actually happening, but I've started deleting the jobs as they clearly cannot run and are causing a detrimental effect on the cluster.

Wednesday, August 01, 2007

supernemo.vo.eu-egee.org ok

Gianfranco had tried to run some supernemo test jobs through Glasgow and had not succeeded last week. I addition I was concerned that some VO members (him included) were not in the grid-mapfile.

Of course, it turns out that Gianfranco, and others are also in other VOs - and as the grid-mapfile can only contain one entry, so they were in their other VOs. In addition, Gianfranco had been using a vanilla proxy to submit the job, the gatekeeper then mapped him to ATLAS and thus the job submission to the supernemo queue failed.

I emailed him the correct VOMS client files for supernemo, so that he could generate a VOMS proxy. After he did that his test jobs ran fine.

So, our first DNS VO, supernemo.vo.eu-egee.org works.

Monday, July 30, 2007

CIC Portal Reports

As I was on holiday on Friday, I tried to fill in the CIC portal report quite late on and it was locked.

Frankly the interface on the report section of the CIC is really rather rubbish (no aggregation, unreliable locking, difficult to review) and the 1 day time window has always been restrictive.

As it's clear now that the CIC "availability", where sites get to mark up failures as relevant/non-relevant/unknown, is a thing of the past (gridview will be used, warts and all...), the whole thing looks rather broken as a way of us telling Jeremy and Phillipa issues to report in the Ops meeting.

However, I checked the gridview page. Looks like a quiet week. Our one CE test failure was the infamous Globus 79...

Report: Quiet week. Ran lots of jobs ;-)

Monday, July 23, 2007

Job submission with DIRAC

In order to get some "real user" experience of performing physics analysis on the Grid, I have been doing a lot of reading and playing with the LHCb computing software. First of all, there's a lot of it so it takes a while to get to understand what each component does, how they can be linked together, how they are configured and built and how the applications can be run locally or on the Grid to do some real physics.

I was particularly interested in getting some basic jobs running on the Grid, so I quickly started playing with Ganga, the user interface for job configuration and submission. At first I was quite impressed. It was very simple to use Ganga to submit small jobs to the local system, CERN batch or the Grid (via the LHCb DIRAC workload management system). However, a few problems quickly appeared:

1. Jobs were continually failing on the Grid due to poorly configured software installations on the sites. Missing libraries was the main source of problems. It also seems that the latest version of Gauss v30r3 (LHCb MC generation) is a bit broken due to a mis-configured path. These things weren't a problem with Ganga as such, but using it meant that another layer had to potentially be debugged.

2. I found bulk job submission was very difficult in Ganga. Writing the python code to loop over the jobs is easy, but the client just couldn't handle the 100's of jobs going through it. It became very slow and eventually just hung. Even just starting up the client is slow. Maybe running on a non-lxplus machine would be better. There were also inconsistencies between the Ganga job monitoring and that reported by DIRAC.

As an alternative, I decided to bypass Ganga and use the DIRAC API directly. This proved to be quite successful, being much faster for bulk submission. I put together some notes on this, which can be found here:

http://twiki.cern.ch/twiki/bin/view/Main/LHCbEdinburghGroupDIRACAPI

Using DIRAC didn't help with the site mis-configurations (although it is easy to get the job output and check the log files for problems), but I found it a more efficient way of working. I'll try again with Ganga once I understand better the problems that keep on appearing on the Grid.

From my brief foray into running jobs on the Grid, it appears that Ganga/DIRAC do insulate users from malfunctioning middleware, however, there are still real problems when it comes to poorly installed software on the sites. From a deployment point of view, maybe this should be taken as encouragement, as the problem is at the application level and not so much with the middleware. I think we would need to do a more systematic study to find this out (much like Steve's ATLAS jobs).

What is needed is better testing of the sites by through VO-specific SAM tests. This information then has to be fed back into DIRAC (or whatever) so that mis-configured sites can be ignored until their problems are resolved. User will then find running jobs on the Grid a much easier and pleasant experience.

Return of Globus Error 79...

Glasgow suffered a bit from the infamous Globus Error 79 last week - the one we think might be the unexplained gatekeeper identity error. In fact it was bit of a flaky week altogether - Steve Lloyd's tests seemed to be suffering from some RB issues and most sites dropped into the 70% efficiency range last week.

Overall though, even the "bad" weeks are not so bad as the gridview availability plot shows (remember this is a "warts and all" plot - no excuses or chances to mark things as non-relevant) we seem to still be 95% plus.

However, the gatekeeper error is a real pest and I still don't have a good way of even trying to get a handle on it. I checked in the gatekeeper logs, against some known error events of this type (David found these on 11 June), but alas there's no clear signature.

Holiday Time: Durham Survives

Phil's been away for the last three weeks, with responsibility for Durham falling between myself (to respond to tickets and advise on grid problems) and Lydia (on the ground to press the buttons). This has worked pretty well - we have managed to deal with helmsley locking up (week 3 in the graph) and needing rebooted and a period of scheduled downtime (week 4, which revealed a problem in downtime synchronisation between SAM and the GOC).

A sterner test will come at Glasgow in a fortnight when Andrew and I are both on holiday and are more or less uncontactable. Time for icons and prayers?

ECDF Update

The procurement of our grid front end nodes for ECDF has been held up for about 2 weeks. Sam had tried to setup one of the old Edinburgh worker nodes as a trial CE to iron out any glitches in the gatekeeper scripts, job manager, accounting chain, but has hit an ACL in some router between ECDF and KB which prevents him from being able to even qsub.

The systems team are going to give him a requisitioned worker node to test with instead, which should happen this week - although as they are going live today and it's holiday time this has also been somewhat held up.

Friday, July 20, 2007

SAM Gets Downtime Wrong?

Durham were in downtime from Wednesday -> Thursday, but SAM thinks they were in downtime from Thursday -> Friday. Doubtless this happened because I made the initial mistake (out by 1 day!), then edited the downtime to bring it forward. But SAM did not pick up the change.

I've raised a GGUS ticket - after all I'm always editing downtimes!

Glasgow to Edinburgh Lightpath Approved

After messing about for ages with application forms and revised procedures, our application, when finally submitted to JANET, was approved within 24 hours! They are now conducting a technical feasibility study, but no news on how long that might take.

Update: JANET people say this study will take "the shorter end of 'a few weeks'", which is good news.