ScotGrid: August 2007

Friday, August 17, 2007

TPM Round Up

It was TPM week again. Probably not spend more than about 8 hours over the week dealing with things, so it remains a background task against which other things can be done. Between Pete and myself we probably had quite a reasonable response time. Spam tickets continue to be greatly annoying - they amounted to ~30-40% of all tickets (I also cleaned them up when they'd been submitted to VO support units, when I saw them).

Pheno Attacks!

Well, to round off a dreadful week in terms of users slapping the whole system about, the CE load spiked and we hit a System CPU storm again this afternoon. Fortunately this time I caught it within an hour and killed off the offending processes, and the system managed to recover ok.

Very oddly it was caused by a pheno user's gatekeeper processes stalling - gobbling CPU, but failing to submit any jobs onto the queue. I had to add their DN to the banned list and kill off these processes pending further investigation.

Action plan:

Nagios alarm on cpu_system > 10%.

ScotGrid and AliEn

After a long hiatus, Dan and I sat down to look at getting the AliEn system submitting properly into ScotGrid. Battling through OO Perl (yuk!) we found the various points where the system seemed to lack configuration information - this was basically a value for the site BDII to query free slots, a CE queue name and the name of the VO (gridpp). When we'd hacked those values into the LCG.pm AliEn file then things started to look a lot rosier.

However, I think that AliEn is going to want a dedicated queue, because of the way it parses the BDII's output, which inclines us towards setting up a VO for panda sooner rather than later.

Glasgow and Phenogrid

We're trialling shell access for some Durham phenogrid users here at Glasgow, giving them the ability to rsync the latest builds of alpgen onto the cluster then fire off grid jobs against them.

Although in some ways this is a step backwards, the current EGEE/LCG software deployment mechanisms are too cumbersome for the kind of work that most of the Durham users want to do. In giving them access to a much larger tranche of ScotGrid resource we do help to promote grid work over local queues at Durham (which is a smaller older cluster). If it works then there's the motivation to find a more generic solution to the software problem over the whole grid.

Durham Network Outage

Durham suffered a network outage last night, losing all connectivity to the outside world. The grid systems themselves recovered well this morning, but we suffered about 7 hours of downtime.

SL4 Upgrade News

There seem to be enough outstanding issues with the SL4 upgrade that we have decided to hold off for now in ScotGrid. We will review the situation again after CHEP (w/o 10 Sept), and if there are no show stoppers at this time, then the week of the 17th will be the upgrade week for Glasgow and Durham.

At Edinburgh the ECDF resource is SL4 anyway, so getting this up and running will be the perfect way to iron out ScotGrid SL4 issues. It also makes more ScotGrid resource availiable, rather than risking currently functioning resources at Glasgow and Durham.

Tuesday, August 14, 2007

Bad, bad, biomed....

A very flaky day - we had a biomed user throw jobs into the system which were trashing workernodes by filling up /tmp. This caused lots of nodes to get into a weird state where they seemed to run our of memory (ssh and ntp nagios alarms firing). Jobs couldn't start properly on these nodes, so they became black holes: one of our local users lost 47 of 50 jobs, another lost 124 of 150.

We then started to fail SAM tests, drop out of the ATLAS BDII, Steve's tests then couldn't resource match us, and so on. Bad day.

It took quite a few hours to sort out the mess, and a further few hours to stabalise the site.

Our GGUS ticket names and shames.

It's a very different ball game on the grid - we have 6000+ users who can submit jobs, and it's not hard to kill a worker node. The torque generally handles this badly and all hell breaks loose.

Action plan:

Nagios alarms on WN disk space
Group quotas on job stratch area

DPM Gridftp Resource Consumption

Durham was suffering from excessive resource consumption, from "hung" dpm.gsiftp connections from ATLAS transfers. Because of the way that gridftp v1 servers work huge buffers were being held in memory, leading to resource exhaustion on the machine and a subsequent crash.

Phil and I discussed this, and I noticed that the active network connections were to the RAL FTS server, not to the source SRM, so it looked like it was the control channel which was hung open, not the data channel.

Greig had a look on Glasgow's servers and discovered the same problem, but we were relatively unaffected due to the whopping 8GB of RAM we have in each disk server (and by having 9 disk servers, presumably). Cambridge also reported problems.

The issue is being looked at by the DPM developers, but for the moment Phil's had to write a cron script to kill off the hung ftps to keep gallow's head above water.

Maximum Queable Jobs

From our two phenogrid DOS attacks, it seems that the maximum number of queued jobs the system can cope with is about 2500. After this the system slides into a crisis, running out of CPU with too many gatekeeper processes active and a context switch storm starts - from which the system can rarely spontaneously recover, it seems.

So, I have set a max_queueable parameter on every queue of 1000, which seems a reasonable number for any single VO or queue.

It seems a limitation of torque that it cannot also have a global cap on queued jobs (at 2500, for instance), but this is only a parameter settable for queues.

Monday, August 13, 2007

MonAMI keeps on working...

Gosh, I tickled pink at how well MonAMI coped with Glasgow's job storm last Thursday.

To put this in context, we had a CE that was running at a constant 100% CPU usage: 50% in kernel (context-switching) and 50% in user-land (running Perl scripts). The ssh daemon wasn't working properly any more: the ssh-client would (almost always) time out because the ssh server would taking too long to fork. The machine's 1-min load average peaked at ~300!

All in all, this was an unhappy computer.

Despite all this, MonAMI just kept on going. As matters got progressively worse, it took longer and longer to gather data, particularly from Maui. From the normal value of less than a second, the Maui acquisition time peaked at around 15 minutes. Torque's faired better, peaking at around 30s (still far longer than normal).

Despite this, MonAMI didn't flood Torque or Maui. It only ever issuing one request at a time and enforced a 1-minute gap between successive requests. MonAMI also altered it's output to Ganglia to compensate for taking 15-times longer than normal. This prevented Ganglia from (mistakenly) purging the metrics.

So, although everything was running very, very slowly and it was difficult to log into the machine, the monitoring kept working and we have a record of what was happening to the jobs.

Incidentally, the failing ssh is why most (all?) of the jobs were going into wait-state: the worker node mom daemons couldn't stage-in, via scp, the files the jobs needed. This would fail the job being accepted by the WN, causing torque (or maui?) to reschedule the job for some time in the future, putting the job in to wait-state.

Welcome Back! (sic): Picking Up The Pieces

Fool that I am, I opened my laptop after getting back from Gairloch on Saturday night. As I now have Paul's MonAMI torque plots on my Google homepage, I could see that the number of running jobs was down to almost zero. This was unexpected. A quick revision of the SAM pages and monitoring plots showed the pheno job storm on Thursday had killed the CE off, big time. Being throughly disinclined to engage in extensive debugging late on Saturday night, and knowing the machine needed a new kernel anyway, I rebooted the CE. Whatever the residual problem was this cleared it. Within minutes LHCb jobs were coming in and starting properly - indeed within about 6 hours they managed to fill the entire cluster again.

The remaining problem was then timeouts on the CE-RM test. This was puzzling, but not critical for the site, so I left things as they were at this point, pending further investigation. Then I recalled that this can happen if a period of intense DPM stress (Greig and Billy have been preparing the CHEP papers) causes threads in DPM to lock up or block, leaving few threads available to service requests. I restarted the DPM daemon and bingo! all was well again. The next time this happens we should look in MySQL for pending requests (which get cleared when DPM restarts), however at 10pm on Sunday night getting things working quickly is all I care about.

My car broke down yesterday too, so I'm off to Halfords to buy it some new brake pads; but at least that happened after I came back from holiday. If only the site had the good grace to do the same. I think I get to be grumpy about this for at least 2 days.

Thursday, August 09, 2007

Pheno goes bang, take two!

A problem started at about 02:45 this morning. The large number of pheno jobs that had accumulated in queued state started fail when run. Once failed, the job would go into waiting state, triggering maui to decide which job to run next.

With the current usage and fairshares, Maui's decision is to run the (apparently) broken pheno jobs. This keeps the server-load high and starves the cluster of long-running jobs (there's been 1-min avr load spikes of over 600!).

Look familiar? Here's a entry with very similar symptoms.

I'm in the process of trying to get to the bottom of what's actually happening, but I've started deleting the jobs as they clearly cannot run and are causing a detrimental effect on the cluster.

Wednesday, August 01, 2007

supernemo.vo.eu-egee.org ok

Gianfranco had tried to run some supernemo test jobs through Glasgow and had not succeeded last week. I addition I was concerned that some VO members (him included) were not in the grid-mapfile.

Of course, it turns out that Gianfranco, and others are also in other VOs - and as the grid-mapfile can only contain one entry, so they were in their other VOs. In addition, Gianfranco had been using a vanilla proxy to submit the job, the gatekeeper then mapped him to ATLAS and thus the job submission to the supernemo queue failed.

I emailed him the correct VOMS client files for supernemo, so that he could generate a VOMS proxy. After he did that his test jobs ran fine.

So, our first DNS VO, supernemo.vo.eu-egee.org works.

ScotGrid