ScotGrid: September 2007

Friday, September 28, 2007

Biomed Stalled Jobs

Since we came back after the upgrade to SL4 I had noticed a very large number of stalled biomed jobs on the cluster.

These were all jobs which had stalled running python ./get_task.py autodock AVIANFLUDC2_T02IAN3J1170 (or something very like it).

As the cluster hadn't actually been full, and I was very busy, I actually let this situation go for most of the week. However, today I emailed the user (using the CIC portal user look up). I got a very quick response that there was a known problem with an overloaded AMGA server, which was causing these stalls. I was given permission to kill the jobs, which I did.

Although it's a good thing (tm) to get in touch with users, following our stalled jobs guide, it is time consuming and I wish there was some form of automation we could apply.

DPM Dies

Our DPM died last night (sad!). It seemed that / got full and this then caused DPM and MySQL to get into a punch-up where all the CPU on the machine was consumed.

Investigating (with help from Paul - thanks!) the culprit seems mostly to be an innod db "auto-extending data file" called ibdata1. This has now reached 2.1GB in size.

There is some advice about how to configure innodb to control these sizes, but as the default MySQL install on SL3 has no default my.cnf file we'll have to create a sensible one of these before being able to customise this.

However, after some further investigation, it's now clear that 2.1GB is in fact the size of our DPM database (the gziped database dumps are now 800MB!). This with 14TB of data. Scaling up to 100TB and the DB will be > 10GB. Having looked at the tables, the obvious candidates to trim are dpm_put_filereq, dpm_get_filereq and dpm_req. These seem to contain historical data, but without timestamps it's pretty useless. These tables contain about 233MB, 460MB and 295MB respectively, which is about half that total DPM database size. [1]

Recovery strategy has been to move /var/lib/mysql and /var/log off root, to a larger partition (in our case /disk). Soft links point out from the original locations.

I shall put in a ticket to the developers about trimming these tables when the data has aged into uselessness. [2]

The warning for other sites is that /var/lib/mysql really needs to live in a relatively large disk partition.

We'll address this problem properly when we upgade to the gLite 3.1 version of DPM. In the meantime we urgently need to alarm on disk space usage on all the servers.

[1] Try:
mysql> user dpm_db; show table status like "dpm_%";

(Thanks Paul.)

[2] https://gus.fzk.de/ws/ticket_info.php?ticket=27385

Wednesday, September 26, 2007

Shaking down the user issues

The upgrade to gLite 3.1 on the UI has brought a couple of surprises, which we're gradually working our way around.

* First, the version of grid-proxy-init supplied at part of VDT seems to behave rather differently. A proxy initialised with this looks like:

svr020:~$ grid-proxy-info
subject  : /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=graeme stewart/CN=892101086
issuer   : /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=graeme stewart
identity : /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=graeme stewart
type     : Proxy draft (pre-RFC) compliant impersonation proxy
strength : 512 bits
path     : /tmp/x509up_u218012
timeleft : 11:59:57

And the lcg-RB does not like this proxy at all.

Using voms-proxy-init (without VOMS extensions) gives a rather more normal proxy:

subject  : /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=graeme stewart/CN=proxy
issuer   : /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=graeme stewart
identity : /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=graeme stewart
type     : full legacy globus proxy
strength : 512 bits
path     : /tmp/x509up_u218012
timeleft : 11:59:58

This hits ganga particularly hard, which renews proxies for you, but uses grid-proxy-init.

The fix for this was to ensure that /opt/glite/bin appears higher in the path than /opt/globus/bin and then to create a soft link from /opt/glite/bin/voms-proxy-init to grid-proxy-init.

* Second problem was with UKQCD client software. Craig reported he was getting errors because of a missing globus rft library (libglobus_rls_client_gcc32dbgpthr.so.0). gLite 3.1 is compiled against VDT 1.6 (as opposed to VDT 1.2) and this library is no longer built. The first attempt at fixing it was to copy across the missing library into /opt/globus/lib. This failed, though, because then it became clear that the 1.2 and 1.6 VDT libraries are not compatible with one another, so the new libraries were missing symbols the QCD software needed. So, in the end, the old libraries were copied, lock, stock and barrel, into /opt/globus-glite30/lib and, with a suitable LD_LIBRARY_PATH, the qcd application would run.
There is a new version of QCDGrid software being built, so hopefully that will be compatible with the new VDT.

* Finally (or the last issue which has come to light), is Dan having trouble with g++. The default gcc-c++ (3.4.1) doesn't play well with his version of NLOJET++. He's going to try again with the old 3.2.3 g++ (a.k.a. g++32). Hopefully this will fix things.

RB: "Rather Better"

One thing which went to pot during the upgrade, was the way that the higher UIDed pool accounts cascaded through to the RB. This, unfortunately, meant that any jobs which were running on svr023 were lost (there would have been very few, in fact, which is why we spent more efforts on the UI).

However, in our attempts to get the RB back it became clear that the edg-wl-ftpd service (yet another hacked version of GT2 gridftp) cannot handle UIDs > 16bit. This screwed things up for us, as all our new UIDs are in the range 200000+.

In the end I had to re-hack the perl passwd/group/shadow/users.conf generator, lowering all the UIDs specially for the RB. In fact this was not quite as awful as one might think, as the RB supports only a subset of the VOs we run jobs for on the main site. I also scripted up a "generator" for the RB's site-info.def, that strips down the VOS variable to those we support for job submission. In addition, communication between RB and the UI or the CE is of course mediated by certificate, so having a different pooll account or UID on the RB is not a problem.

There was a supporting tweak to cfengine to take passwd-rb (etc.) as the source passwd file for the RB.

Then the RB was blown away and rebuilt. It seems to have done it rather a lot of good, as now Steve Lloyd's dteam test jobs run properly (see his RB test page).

Tuesday, September 25, 2007

SL4 x86_64 UI now Availiable

I reinstalled the site's svr020 UI on Saturday. This involved an incredible amount of pain related to the bizarre inability of SL4 to properly install GRUB on a linux software RAID partition. Although the machine would install absolutely fine, on reboot it would just halt after the GRUB prompt.

In the end, after tearing my hair out several times (I was working from home on Friday) and trying as many tricks as I could (I even DBANed the disks), I had to retreat from running software RAID1, and fallback to running only on one of the SCSI disks. (As an

That finally gave me a base SL4 install I could work with.

After that, the installation of the SL4 32bit UI was easy - running through cfengine (one little caveat was that the gsisshd restart would kill off the normal sshd on port 22, so that has been disabled).

Then I found that job submission didn't work, because it relies on a 32bit python/C module and the default python is 64bit now. The advice on ROLLOUT was to have a 32bit python higher in the path than /usr/bin/python. This seemed rather bad advice to me, as we'd like to really have 64 bit python - it is a 64 bit system after all! So, instead I decided to change the magic bang path to specifically reference /usr/bin/python32. Initially I tried to use cfengine's editfiles facility to do this. However, anything which is not a completely trivial modification is rather horrendous to do in cfengine (it reminded me of ed, actually), so I eventually abandoned this, and instead wrote a 3 line perl special in the cfengine script sources, and this is called after the RPMs are installed. (In addition to changing the python interpreter it disables the tk graphical interface, for which we don't have any users anyway.)

Finally, I upgraded ganga, and this went fine - ganga runs quite happily with 64 bit python (normally this wouldn't deserve special note, but in the grid world flowers and champagne are in order).

Batch System Goes on Holiday?

When I started to fiddle with the UI and RB on Saturday night, I discovered that the site was failing SAM tests, with the, as usual, marvellously descriptive error "Unspecified gridmanager error".

Further investigation showed that torque and maui servers were not running. When I restarted them the site recovered immediately. The very curious thing was, though, that torque logfile entries were still being written - so there was some part of torque running, but not enough to accept new jobs.

We need a nagios alarm on this. Paul tells me that there is a torque.available metric in the MonAMI sensor, so we should be able to passively monitor this - see the above graph which shows the dropout on Saturday afternoon.

Thursday, September 20, 2007

SL4/5 All bets are ON!

In the wake of Glasgow's upgrade to SL4 Andrew and I were quipping about when we would go to SL5. I jokingly said next year, but really thinking in 18 months.

However, after some discussion, Dr Paul Millar contends that more that 50% of UKI sites will still be running SL4 on the stroke of midnight, 1st January 2010.

Dr Millar - I take that bet. I think that more than 50% will be running something more recent that Sl4 on that date.

Further, Paul thinks that there will be at least one UKI site running SL4 in January 2013 (after the end of lifetime for RHEL4). I don't.

In each case the wager is a bottle of Veuve Clicquot Yellow Label.

ATLAS jobs running properly

Chris submitted a sample ATLAS job to the cluster, using ppeui and the RAL RB.

It worked!

So in addition to passing the tests, we can also do real work. Tests and working for real jobs have a correspondence, but are not exactly the same (tests being a necessary, but not sufficient condition for doing real work, in gereral) so I am very pleased.

He'll now throw in 500 jobs, so we look forward to having the workload ramp up.

Glasgow Upgraded to SL4

The upgrade is done! We started passing ops SAM tests at about 2230 last night, and I brought us out of downtime at 2300. That was 12 hours of total downtime. In addition the queues were closed from about 1600 the day before, so that meant we were unavailable for 31 hours. In the grand scheme of things I think, "not bad," for such a major upgrade.

Preparations for the upgrade were rushed, but certainly thorough enough for us to have a fair degree of confidence in the process. By Tuesday night I was able to reboot, rebuild and run jobs through a worker node successfully. Andrew was close to having the new pool account generator done, even if he had wimped out and used perl.

We had decided the plan was to upgrade the worker nodes and bring us out of downtime ASAP, then work on the UI and other less central services.

Here's my synopsis of what went wrong, or didn't behave quite as we expected:

We initially tried to reboot the worker nodes in batches of 30. This overloaded dhcp or tftp on svr031, so in fact only 4 nodes were successful in that batch. Subsequently we did batches of 12, which worked fine. We could also put a larger stagger on the powernode reboot script (we had only used 1s).
Analysis: It was always going to be hard to know what level we could do this until we tried. It was easy to work around. Probably our rebuild time for the whole cluster is ~2-3 hours because of this node throughput limitation.
At the last minute I decided to just drop alice and babar to stop us from supporting VOs who just don't, or can't, use us (it's just clutter). However, that change was imperfectly expressed in site-info.def, so on the first batches of nodes YAIM just didn't run.
Analysis: This was a mistake. Andrew and I should have co-ordinated better and had more time to review the new user information files.
There were a few problems with the user information files: sgm and prd accounts weren't initially in the normal VO group. In addition local Glasgow users were in the wrong group. This was fixed pretty rapidly.
Analysis: As above. This aspect of the preparation was too close to the critical path - and it didn't work first time.
The new server certificates were botched initially. Although we were in downtime and it was relatively easy to correct, it was a distraction. Analysis: We need to document local procedures for certificate handling better.
We'd been obsessing about the batch worker configurations, with the intention to leave the servers pretty much alone. However, we hadn't twigged that the change to pooled accounts for sgm and prd users would, of course, require the LCMAPS group and grid mapfiles to be updated. As no one on site is an sgm or a prd user this was not picked up during testing. It only came to light once I did a logfile analysis of why ops tests were failing (these are done as an sgm ops user). Later in the evening it became clear that this also had to be done for the DPM disk servers.
Analysis: If I'd been sharper I would have realised this in advance (but there was a lot on my mind). It would be useful of one of us had a special role to do this testing (gridpp VO would be ideal). However, it would actually have been a terribly hard thing to test, as the site was "live" during the testing phase and this problem's solution implied reconfiguring the CE as well as the pool accounts. Hopefully writing it down here will make us more cognoscent of this next time!
Running YAIM automatically if all well and good, but how do we know it's run successfully? We not only had nodes where YAIM jusy hadn't run, we also (and this was the last problem to be fixed), had two bad nodes where the directories in /opt ended up in mode 0700, so were unreadable.
Analysis: We need to develop a test and alarm system for which attempts to validate the YAIM run. At the moment we're pretty much flying blind. The two proxies which I ended up using yesterday were:

Look for files generated by YAIM, e.g., /opt/glite/etc/profile.d/grid-env.sh. There should be a nagios alarm or a cfengine warning if this file absent.
Check permissions on directories such as /opt/glite/etc. If this is not readable to a pool account then something has gone wrong.

Summarising, I think a pretty good job was done yesterday. It was a major upgrade and our first significant downtime since last November. If we can keep these sorts of interventions down to the 1-2 day level then the site will continue to be considered a good one.

However, we're working as a team now, rather than me playing Lone Ranger. This makes co-ordination, documentation and testing even more vital. Once Mike comes properly on board his first major task will be to understand and then document how the cluster is run.

Wednesday, September 19, 2007

Yarr! Thats not line noise, it's Perl (me hearties)

Avast Ye Salty Sea-dogs - It be the glasgow upgrade day today (as well as international talk like a pirate day). Graeme gave me a minor task to do - come up with a new userid/groupid/passwd/shadow/yaim config generator script for the pool accounts. Simple enough, should only take half a day or so....

Much swearing at Python later, I gave up at 1AM this morning and resorted to trusty Perl. Done in 48 lines (including comments) and 2 hours. I think Python and I are going to take a looooong time to get aquainted properly.

I'm sure G will blog in more detail - Worker nodes went fairly smothly, few niggles - seems that about 24 simultaneous installs cause tftp timeouts. Discovered the Sandbox dirs on the RB needed their ownerships changing - again another Perl script to the rescue with the trick I learned from Steve Andrews - just get your script to print the command line you'd like to stdout, then once you've checked it looks reasonable, run it again piped through | /bin/sh

Friday, September 14, 2007

Glasgow upgrade to SL4 x86_64 next week

Issues have been gathered and a work plan is in place. The intention is still to upgrade next week, hopefully starting (and finishing) on Wednesday.

ECDF for Beginners

Basically, this is proving far more painful than anticipated. Although the MON/LFC box has been configured, the CE is proving seriously problematic. The ECDF team thought that SL3 was not a winner for GPFS, so Sam tried using the gLite 3.0 CE on top of SL4. This didn't work (not unexpectedly). Although we know that the lcg-CE has been built for gLite 3.1, it's not yet even been released to pre-production, so clearly there's nothing we can use for a production site. So Ewan reinstalled the CE with SL3, in order to install the old gLite 3.0 version. However, it then proved to be very difficult to get GPFS working on SL3, so this is still a work in progress. How long it will take to resolve is anyone's guess.

GPFS is necessary for the software area and the pool account home directories. At this point I would just buy a 500GB disk from PC world and run with that for a month while we wait for the gLite 3.1 CE, but we can't do that with machines other people are running.

Getting the site certified for the end of the month now looks challenging.

Hmmm....

Durham News

Phil's procured a 15TB disk server for their DPM. This is now ready to go and we should get it configured and added to the Durham SRM next week.

WLCG Workshop

There's a lot on the agenda page, and Jamie's CHEP summary talk is a useful round up, but here are my highlights:

GDB consider services necessary for WLCG operations to basically have been delivered, but not fully deployed (far less extensively tested and battle hardened). My own feeling is that in the integration and mutual interaction (e.g., VOMS and SRM v2.2) of all these services with WLCG production we've still a long way to go.
In their pre-acceptance tests, the gLite-CE and the CREAM CE did about as well as one another. Given that CREAM has a simpler internal architecture and is more standards compliant it has been chosen over the gLite-CE (which is "no longer being developed"). It's anticipated that CREAM will be first delivered to production in the New Year, but that in total it's anticipated to take about a year from now before it's fully hardened, sites have experience in running it and the YAIM configuration is fully working - oh! just around data taking time ;-) However... this seems rather sensible to me to have settled on the one CE, rather than the unsatisfactory situation of having 2 on the go. Pity the poor folk who put effort into actually running gLite CE as a service!
Operations: We still have trouble sharing knowledge. 2007 was the year of the grid blog, but only a fool would pretend that will solve all our problems. I proposed a wiki plugin, where articles less than a month old would be black on white, with the text gradually fading to paler shades of grey the longer the article was unrevised. When it reaches white the article is expired!
SAM is the test framework for everyone. Experiments should publish their tests into SAM so that it's a one-stop-shop for sites' status.
Monitoring: We look forward to a proper demo of the SAM/nagios framework at EGEE.
SRM v2.2: Confusion still abounds. What spaces to ATLAS want at T2s? Not even they know yet...
Dress Rehersals: Lots going on for every experiment. See their presentations for more details.
Common Computing Readiness Challenge (CCRC): There's an urgent need to tests the whole of the Accelerator->T0->T1->T2 chain for all experiments before real data flows. This will probably happen in February (limited, because not everything will be ready) and in May (really, has to work!). I expect this to be quite a big deal for everyone, especially the May round.
ATLAS Session: Good information on Production Dashboard, DDM dashboard and operations (with cool http queries).

The Midnight Blogger

Well, not quite midnight, but there's been so little time to write the blog this week it seems I have to do it late on a Friday night.

Are you sitting comfortably? Then I'll (grab a beer and) begin...

Wednesday, September 12, 2007

Nagios Acknowledements

We've been sucessfully using Nagios as one component of our site monitoring. The email and Jabber notification is great, but when someone acks a problem, we have to look at the webpage to see who did it and what comments they added.

One minor tweak to commands.cfg soon fixed that - thanks to the macros $SERVICEACKAUTHOR$, $HOSTACKAUTHOR$, $SERVICEACKCOMMENTS$ and $HOSTACKCOMMENTS$ - described in more detail in the documentation

I'll post the snippets onto the HepSysMan wiki soon...

Maximum Queable Jobs Bites Back

One of our local ATLAS users wanted to submit 2000 jobs onto the system, which I thought would be ok. Unfortunately he hit the 1000 max_queuable limit, and started having jobs fail. Worse, other ATLAS jobs could also not be queued and we failed quite a few of Steve's tests.

Another unexpected issue was that max_queuable seems to apply to running+queued, which was rather unexpected.

Reconsidering the issue I have decided to set the max_user_queuable parameter to 1000 on each queue instead.

This will prevent users from DOSing their entire VO, but should prevent accidents taking out the CE.

Tuesday, September 04, 2007

Glasgow goes dark

Glasgow computing service lost power to the main campus routers this morning - and the UPS didn't work. So Glasgow was down completely for 3 or so hours - failing everything. Durham and Edinburgh were failing RM tests because the Glasgow BDII could not be contacted.

Single points of failure, eh! It's long been a Glasgow complaint that only one BDII can be specified, with no fail over.

Sunday, September 02, 2007

Durham Problems

Durham had a 2 day downtime last week as their network was down. Their central services were having a significant outage. A knock on effect from this seemed to be DNS issues which affected the site at the end of the week in a way that was not explained to me.

However, by the weekend the site was back up and running, so hopefully everything's ok now!

The Quiet Before CHEP...

Blog posts have been a bit thin on the ground this week, both because of GridPP 19 and the sense that CHEP was looming and the poster had to get done and the paper had to get written.

Well, after a few delays and some severe jet lag I did arrive and with a poster too. If you want a quick preview I put the poster's pdf on the web.

Weather is good, but my phone can't find a network, so don't try calling!

Ambleling in Ambleside

GridPP 19 took place last week in Ambleside. The Lake District looked beautiful, and although I couldn't stay for the Thursday afternoon ramble, I managed to walk on Loughrigg Fell on Wednesday morning and a quick hike up Orrest Head before getting the train back to Glasgow.

Helped, I think, by the pleasant environment, the meeting was very relaxed, but also very useful as the discussion sessions were very helpful with lots of good contributions.

My ScotGrid talk was a report on ScotGrid and a slightly humorous review of the life of ScotGrid.

ScotGrid