Wednesday, March 28, 2007

ScotGrid meets AstroGrid

I was invited to talk about data movement in GridPP/WLCG to the AstroGrid people today. It was really informal and a chance for GridPP and AstroGrid to learn a bit more about one another's problem domains and look at the solutions that each project has adopted.

AstroGrid have a nice virtual filesystem implementation called VOSpace, which allows astronomers to interact with IVOA databases and resources. They now want to extend the concept of these personal virtual spaces into the Astronomy data area. So, a database at the ROE could have a database for an instrument, with tables for different data, and an astronomer would be able to query and store their results in a virtual space within the ROE database (roe/user/graeme/myqueries/hotstars). These results could then be copied out, e.g., into a VOSpace area.

The cunning bit is can you combine query results between databases? i.e., do a join between a database view at ROE and one at Cambridge? Well, this requires a (possibly large) amount of data to be shipped between these two databases.

The question they wanted to ask me was, did I know an efficient way of transferring the data between the databases?

Now, this is quite a different from the LCG problem. So, instead of saying, "buggered if I know" (with my facetious streak, it was tempting ;-), I described the sorts of data flows associated with LCG, the software components we use and the achievements and limitations of our solutions: FTS is great - it ships PB of data around very reliably 24x7, balances VO and site requirements, etc.; FTS is rubbish - it needs Oracle and only talks to SRMs.

Actually, it became quite an interesting and wide ranging chat about grids - different methods of working and how to convince users to use it. They were quite heartened to see that in HEP the VOs really do now overwhelmingly use the grid for their activities.

In the end it seems they actually want to start this quite small - get the virtualised results spaces working first, and then probably tackle the bulk data shipping later. They were definitely interested in looking at an implementation of VOSpace which used an SRM as a backend store (at the moment they haven't really gone beyond the constraints of a single RAID array). And when they do come to look at bulk data movement, then they will look at FTS and RFT as possible methods for doing it.

Nice to make contact with other communities. They get a good view from the top of Blackford Hill...

Tuesday, March 27, 2007

ATLAS DPM Pool at Glasgow Set "World Writable"

When I was poking around looking for lost DPM files last week I found a problem with our large dedicated ATLAS DPM pool. The problem is that the pool is restricted to one GID only, but different VOMS roles are mapped to different GIDs by DPM. So in fact we have:
mysql> select * from Cns_groupinfo where groupname like 'atlas%';
+-------+------+-----------------------+
| rowid | gid | groupname |
+-------+------+-----------------------+
| 2 | 103 | atlas |
| 16 | 117 | atlas/Role=lcgadmin |
| 17 | 118 | atlas/Role=production |
+-------+------+-----------------------+
3 rows in set (0.00 sec)



So only "ordinary" ATLAS users were getting access to the big pool. ATLAS production was being relegated to the general pool! No wonder it had filled up so much:
POOL atlas
CAPACITY 27.22T FREE 23.04T ( 84.6%)
POOL generalPool
CAPACITY 6.81T FREE 2.68T ( 39.4%)
As the ATLAS data was split so evenly between regular and production use I decided to make the pool writable by all groups. Now, when DPM 1.6.4 is released we will want to switch the pool to be writable by all our ATLAS groups. This clearly leaves a lot of data misplaced. I have made a feature request to enhance dpm-drain to try and help sort the mess out.

Rare DPM srmPutDone Failures at Glasgow

Just a place holder really. We failed a SAM RM test this morning, and for the first time in I can recall it was a DPM failure, not an information system problem.

The error message was Setting SRM transfer to 'done' failed: Unregistering alias from catalog.+ result=1.

Looking in my SRM logs clearly something was badly wrong:
03/27 06:04:25 15848,0 put: request by /C=CH/O=CERN/OU=GRID/CN=Piotr Nyczyk 6217
from node139.beowulf.cluster
03/27 06:04:25 15848,0 put: SRM98 - put 389251 389251
03/27 06:04:25 15848,0 put: SRM98 - put 0 srm://svr018.gla.scotgrid.ac.uk/dpm/gl
a.scotgrid.ac.uk/home/ops/generated/2007-03-27/file92308f95-2e1e-43ba-b8bf-da8f1
c7c416e
03/27 06:04:27 15848,0 getRequestStatus: request by /C=CH/O=CERN/OU=GRID/CN=Piot
r Nyczyk 6217 from node139.beowulf.cluster
03/27 06:04:27 15848,0 getRequestStatus: SRM98 - getRequestStatus 389251
03/27 06:04:27 15848,0 getRequestStatus: returns 0
03/27 06:04:28 15848,0 setFileStatus: request by /C=CH/O=CERN/OU=GRID/CN=Piotr N
yczyk 6217 from node139.beowulf.cluster
03/27 06:04:28 15848,0 setFileStatus: SRM98 - setFileStatus 389251 0 Running
03/27 06:04:28 15848,0 setFileStatus: returns 0
03/27 06:04:29 15848,0 setFileStatus: request by /C=CH/O=CERN/OU=GRID/CN=Piotr N
yczyk 6217 from node139.beowulf.cluster
03/27 06:04:33 15848,1 setFileStatus: SRM98 - setFileStatus 389251 0 Done
03/27 06:04:33 15848,1 setFileStatus: dpm_putdone failed: Internal error
03/27 06:04:33 15848,1 setFileStatus: returns 12
03/27 06:04:33 15848,1 srmv1: SRM02 - soap_serve error : Internal error
So this is a bit worrying.

I looked back through the DPM logs and this is not the first time this has happened. The signature is the dpm_putdone failed: Internal error. There have been 10 of these in the last 90 days:

svr018:/var/log/srmv1# zgrep "dpm_putdone failed: Internal" *
log:03/27 06:04:33 15848,1 setFileStatus: dpm_putdone failed: Internal error
log.1:03/26 20:09:59 15848,0 setFileStatus: dpm_putdone failed: Internal error
log.20.gz:03/08 01:00:55 3442,0 setFileStatus: dpm_putdone failed: Internal error
log.21.gz:03/06 15:53:49 3442,1 setFileStatus: dpm_putdone failed: Internal error
log.21.gz:03/06 15:53:49 3442,5 setFileStatus: dpm_putdone failed: Internal error
log.21.gz:03/07 00:08:22 3442,2 setFileStatus: dpm_putdone failed: Internal error
log.53.gz:02/02 09:50:57 3442,0 setFileStatus: dpm_putdone failed: Internal error
log.59.gz:01/27 07:14:37 3442,0 setFileStatus: dpm_putdone failed: Internal error
log.62.gz:01/24 15:03:46 3442,0 setFileStatus: dpm_putdone failed: Internal error
log.83.gz:01/03 06:35:18 3442,1 setFileStatus: dpm_putdone failed: Internal error
The return code is always 12.

I shall raise this with the DPM people...

OK, done. See GGUS #20178. Just to put this in perspective, we've had ~500000 successful srmPutDone calls, so the failure rate is 2x10^-5.

Finalising New VOs on DPM

In the course of trying to debug some proxy problems for Hannah Cumming from Total, I discovered that new VOs were not properly enabled on our DPM.

A bit more investigation revealed another gridmap file which is used for the DPM and the LFC, /opt/lcg/etc/lcgdm-gridmap. It is also to be necessary to run the YAIM config_mkgridmap on DPM nodes. I have amended the wiki instructions appropriately.

I wasted quite a lot of time on this yesterday, because I was worried that there was some issue with pool account names longer that 8 characters. I'm very glad this turns out not to be the case as it would have been a real pain to redo the new VOs with short pool account names.

A very strange side effect is that I seem to have broken lcg-cr for me - and only when done on svr017. This is the error I get:


svr017:~$ lcg-cr -n 1 --vo dteam -v file://etc/group -d svr018.gla.scotgrid.ac.uk
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
Source URL: file://etc/group
File size: 523
VO name: dteam
Destination specified: svr018.gla.scotgrid.ac.uk
Destination URL for copy: gsiftp://disk036.gla.scotgrid.ac.uk/disk036.gla.scotgrid.ac.uk:/gridstore0/dteam/2007-03-27/file0d9b09fa-9efd-4ee8-8c3e-b504f175f231.389985.0
# streams: 1
# set timeout to 0 seconds
Alias registered in Catalog: lfn:/grid/dteam/generated/2007-03-27/file-54f896cf-4e4c-408a-b23e-40cb9196256b
the server sent an error response: 550 550 disk036.gla.scotgrid.ac.uk:/gridstore0/dteam/2007-03-27: Permission denied.

Odd:

  • I have compared my environment with Andrew's and there's no meaningful difference - yet he can lcg-cr fine from svr017.
  • It works when I do the lcg-cr from ppeui.
  • rfio, globus-url-copy and srmcp all work perfectly (for both gridpp and dteam VOs).
  • I have tried using different streams in lcg-cr but that makes no difference at all.

I'm relegating this to the curiosity pile...

Monday, March 26, 2007

Gatekeeper Troubles at Durham

Durham went on the blink at about 1am today - suddenly failing JL. The error message was the usual erudite globus effort: Got a job held event, reason: Globus error 3: an I/O operation failed.

Well, it looked straight forward enough - it's an i/o error, right. I found a lot of hints on google that this was caused errors in transferring in the sandbox. So check gridftp, home directory quotas, etc. Mark and I spent lots and lots of time on this, checking different things, becoming more and more confused (ok, so gridftp of a file works, can I make a directory using edg-gridftp-mkdir? have we restarted the gatekeeper properly? what's bound to ports 2811 and 2119? etc., etc.).

In the end we just could not fathom what had gone wrong, so I suggested to Mark that he email LCG-ROLLOUT and TB-SUPPORT.

Maarten Litmath pointed us to a GOC Wiki article which also said that this i/o error could occur when the CE was short of memory. I have found the culprit code in the l_check_memory function in Helper.pm - it produces a failure if the free memory (swap + physical) on the CE is less than 20% of the total. However, this error is not passed up the stack properly (in fact the code in queue_submit() returns undef) and so an entirely misleading error is passed back which wasted hours of our time. Grrrr.

I was reminded of a Alice in Wonderland...

'When I use a error message,' Humpty Dumpty said, in rather a scornful tone, `it means just what I choose it to mean -- neither more nor less.'

`The question is,' said Alice, `whether you can make error messages mean so many different things.'

`The question is,' said Humpty Dumpty, `which is to be master -- that's all.'

Alice was too much puzzled to say anything; so after a minute Humpty Dumpty began again. `They've a temper, some of them -- particularly Globus errors: they're the proudest - batch system errors you can do anything with, but not Globus errors - however, I can manage the whole lot of them! Impenetrability! That's what I say!'

`Would you tell me please,' said Alice, `what that means?'


I have submitted a GGUS ticket - these things won't improve unless they are complained about: https://savannah.cern.ch/bugs/index.php?25048.

Friday, March 23, 2007

Local Top Level BDII now used at Glasgow

Somewhat related to the last post, I switched Glasgow to use our local top level BDII instead of the RAL one on Wednesday.

In fact, RAL now have 3 top level BDIIs, and have been much less problematic recently - so perhaps this isn't really necessary any more? However, after 2 weeks we'll count up the number of BDII failures at Glasgow and compare them to Edinburgh and Durham.

Gridview and Site Reliability


The reliability of Tier-2 sites was a major topic at GridPP 18. This is going to be measured in an automated way from the SAM tests. I was very surprised, though, when I compared the SAM CE tests (the test I look at every dayt) against gridview. I measured our test pass rate over the last 3 weeks to be 93% (Durham), 96% (Edinburgh) and 97% (Glasgow). However, GridView measured us down as low as 80% (see slide 8 of my talk).

John Gordon had given the availability formula in his talk: and it's CE & SE & Site BDII & SRM. So I had to check these tests as well. What I discovered was that it was failures in SE and SRM which were pulling us down (site BDII tests were 100%). However, investigating these further I discovered that it was in fact BDII failures in SRM and SE tests which were the problem. More that that, it was not failures in the site designated BDII, but that the BDII had been hardcoded to sam-bdii.cern.ch - and all the failures were from this component. It's used by both those tests.

I have raised a GGUS ticket asking to get this changed to the site defined BDII. There is absolutely nothing a site can do about the failure of a central component at CERN.

If one estimates a failure rate on information system lookups during replica management tests of ~2%, the fact that the information system is used for CE-RM, SE and SRM, i.e., 3 times, means a site just cannot get any better than 94%!

Clearly this identifies the information system as a major problematic component which needs to be addressed if we have a hope of reaching our 95% targets. See Laurence Field's GDB talk for some ideas in this area - retries and caching look to be essential.

I have one other serious issue, and one quibble, with gridview. The serious thing is the 10% quantisation on the plots, which is a nonsense when our target is 95%. The quibble is the stupid mapping of the GOC sitename to a quite unguessable 6 letter abbreviation (old scotgrid-gla site was "GLSGW", the new site is "SCOTG"). Clearly they should learn from the EGEE accounting pages and give us a heirarchical tree view with the correct site names.

Jobmanager Tweaks at Glasgow

I finally got around to applying the Cal Loomis patch to the gatekeeper which helps catch jobs in the torque Completed state. Instead of patching lcgpbs.in and then reconfiguring Globus, I patched the final perl module in /opt/globus/lib/perl/Globus/GRAM/JobManager/lcgpbs.pm and then added this to our cfengine configuration.

In addition I patched the pbs jobmanager in the same way.

I have also now enabled the pbs jobmanager in /etc/globus.conf. Change the jobmanagers line to read

jobmanagers="fork lcgpbs pbs"

and then add

[gatekeeper/pbs]
type=pbs
job_manager=globus-job-manager

before restarting the gatekeeper.

At first I thought that it was broken as I was getting GASS cache errors, but in fact this turns out to be because the pbs jobmanager cannot deal with non-shared home directories (the lcgpbs one can). We don't have shared directories for EGEE VO users - however, we do use shared homes for NGS and local users, so David reports that it works for him (he's in NGS).

In addition we now pass 14/16 of the NGS GITS tests. The two we don't pass are gsissh and gsiscp, because GITS naively assumes these are on the gatekeeper host. For us they are not as we maintain a gsi login host for NGS and local users on svr020 instead.

Good progress, though.

Contact with VOs


In my GridPP 18 talk I said we still felt we didn't have enough contact with VOs in general. A specific example of this is biomed. When I added the new VOs on Monday I found there was a problem with the biomed pool accounts, which seemed to have the wrong stub names (boimNNN instead of biomedNNN). As soon as I fixed this we got biomed jobs into the cluster.

So, we were a large site, with a biomed queue enabled, where things were broken from November until March - and we never got a ticket.

Of course, all the VOs are sorting out their operational procedures, and I'm sure that many other VOs would have been no different - but that's the point isn't it? Our contact with non-LHC VOs is poor right now.

Thursday, March 22, 2007

ScotGrid Hosts GridPP 18


GridPP 18 was hosted at Glasgow this week. The weather was good, the wireless network more or less survived and lots of people came to see the machine room. I had a good time - it was nice to play host.

You can read my presentation on the Tier-2 in case you slept in or weren't at the meeting.

Monday, March 19, 2007

New VOs at Glasgow: camont, gridpp, totalep

I've enabled the camont, gridpp and totalep VOs on the Glasgow cluster.

A complete description of how to do this is on the wiki.

This stupidest task must have been to write a script to reverse engineer a YAIM users.conf file from our group/passwd files so that the YAIM utility functions like users_getvogroup work. There's surely an easier way of doing that?

This afternoon I'll redo the RB and try and run some jobs through as a gridpp VO member.

DPM Disk Server Controlled by cfengine

Working on enabling gridpp, camont and totalep VOs I realised that the DPM disk servers had never been brought under cfengine control. This was a little trickier as it was the first set of SLC4 servers which were done, but in 30 minutes they were all ticking along happily, automatically updating authentication and other files.

There's still something rotten on disk035, which consistently refuses to yum update, complaining that the fetch-crl package is required and conflicts with glite-SE_dpm_disk - despite being installed! Perhaps its RPM database is corrupted? The workaround is to forcibly uninstall fetch-crl and then yum update.

However, at some point we should probably drain this machine and re-install it.

Friday, March 16, 2007

Resource Broker (Beta) for ScotGrid

Through the wonders of YAIM and cfengine, I was able to setup an lcg-RB on svr023 in two cfengine lines: download metapackage, run configure_node.

And it works! I got output back from my first job:

ppepc62:~/jobs$ edg-job-status https://svr023.gla.scotgrid.ac.uk:9000/5hT0x7GZluDFMWJ6qT0KLQ


*************************************************************
BOOKKEEPING INFORMATION:

Status info for the Job : https://svr023.gla.scotgrid.ac.uk:9000/5hT0x7GZluDFMWJ6qT0KLQ
Current Status: Done (Success)
Exit code: 0
Status Reason: Job terminated successfully
Destination: svr016.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs-dteam
reached on: Fri Mar 16 12:24:38 2007
*************************************************************

ppepc62:~/jobs$ edg-job-get-output https://svr023.gla.scotgrid.ac.uk:9000/5hT0x7GZluDFMWJ6qT0KLQ

Retrieving files from host: svr023.gla.scotgrid.ac.uk ( for https://svr023.gla.scotgrid.ac.uk:9000/5hT0x7GZluDFMWJ6qT0KLQ )

*********************************************************************************
JOB GET OUTPUT OUTCOME

Output sandbox files for the job:
- https://svr023.gla.scotgrid.ac.uk:9000/5hT0x7GZluDFMWJ6qT0KLQ
have been successfully retrieved and stored in the directory:
/tmp/jobOutput/graeme_5hT0x7GZluDFMWJ6qT0KLQ

*********************************************************************************


The jobs did, however, run really slowly as R-GMA managed to lock-up twice on me and need restarted. I'm really fed up with this, so I have started an Ops Logbook for the site to at least log these issues in a consistent way.

If anyone has a nagios/cfengine recipe for restarting R-GMA I'd be glad to use it.

Top Level BDII for ScotGrid

I setup a top level BDII for ScotGrid this morning. I have put it onto svr019 (the MON box) as these machines are very capable to running multiple services on this type (and svr019 might as well do something useful ;-).

I'll add a top level alias as bdii.scotgrid.ac.uk, and as load increases more machines can be added here.

Before I actively use it for ScotGrid, though, I will let it run for a week or so - I'm naturally nervous about MCE errors these days and although svr019 has not suffered any thus far, our experience suggests that loading up a machine can cause things to be more fragile in this respect.

Although the BDII is a simple node to install with YAIM, I'm really pleased how much value we get from running cfengine these days - just telling it the machine is in the grid class now does so much. So I only had to add two lines to cfengine to handle the whole install: one in packages, to get glite-BDII installed, one in shell commands to invoke the YAIM configure_node. So easy!

Thursday, March 15, 2007

New Installer Christened: YPF is Born

Working on the new installer again today. With dnsmasq installed on svr031 the installer started to work fine. It is still using the text classes.conf and the move to SQLite might have to wait until svr031 is reinstalled - it's not clear it will work right now. (And today I was very pressed for time - just need to get things working!)

I felt we need a name for the installer. As it's written in python a flying circus name seemed to be appropriate and Andrew suggested "YAIM People's Front". Brilliant.

In the afternoon I added a new host to the cluster, using 2 cheepo netgear switches to connect to the internal and external cluster networks. The tools for manipulating the cluster database are very primitive right now - clearly there's a lot of work to be done here - but a minimal interface allowing an arbitrary SQL command to be issued suffices for now.

I have written some documentation in the wiki.

There's no real intention to make YPF a general installer project at the moment, but other sites might find aspects of it useful.

Wednesday, March 14, 2007

Minor Panic Over DPM Upgrade and GIP


I had a minor panic over the DPM GIP plugin today when I looked over Glasgow's storage stats on the storage accounting pages. I was worried that the DPM schema changes had made the new plugin fail. But when I looked in detail I saw that what has actually happened was that running YAIM had just replaced the new plugin with the old one - hence the publishing problems.

Fix is just to re-enable the new plugin, then things recovered right away.

svr031 Installer Almost Working

Spent more time on the new infrastructure for svr031 again today. I wrote the script to get a dhcpd.conf file out of the cluster database, which was quite straightforward to do. Then dhcpd could function again.

I tried a test install and found that tftp was, amazingly, still working even though xinetd had lost all of its configuration.

Then the install fell over slightly - the kickstart installer was using the old internal hostname of the svr031 server, master.beowulf.cluster, but setting the name server resolution to the university DNS servers. As an installing machine does not yet have the big cluster /etc/hosts file it could not resolve this name. So I changed the kickstart setup to code in only the IP address of the svr031, so no name resolution was needed. This allowed the install to proceed, however then I discovered some cleverness in the kickstart post install script which ensures that even on first boot the machine sets its hostname correctly (in particular, to the routed hostname for grid and disk servers) and that this relies on DNS to function.

So, I either have to rewrite the clever code, or go back to running DNS on svr031.

I decided that running DNS on svr031 was no bad thing. There's a lovely lightweight DNS server called dnsmasq - it loads up a DNS server with the contents of /etc/hosts, which is generated from the cluster database already, and can give out DNS during the install process.

After a machine has first booted, cfengine will copy in the global cluster /etc/hosts and set the DNS to the university servers to remove the single point of failure. But having internal DNS for the installs is nice.

Even better I found dnsmasq is built for RHEL4 x86_64 in the DAG repository, so installing it was a cinch.

Finally, dnsmasq also includes a dhcp server and a tftp server, so I will have a look at running these and possibly simplifying life even more.

Installs should be back up and running tomorrow.

Tuesday, March 13, 2007

Webcache Exemption For New Cluster

Finally, after months of asking, we've been exempted from the University's webcache. There are still some open networking issues, but this is a big step forwards.

New Fabric Management Schema Designed

I spent most of the day with svr031, desigining and implementing the new fabric management database system. Paul and Andrew reviewed my initial schema and suggested some improvements - the current schema is now described in the wiki.

Then I had to write a couple of scripts to take the extant information from the CVOS configuration files, and to add extra information for routed hosts, and then populate the database.

After that was done, I was able to write the first utility script, to extract an /etc/hosts file from the database. This is improved over the version recovered from DNS so I gave it to cfengine to distribute across the cluster.

Next stop, getting a working dhcpd.conf file, so that we can being new installations again.

This is useful and essential work, but bloody tedious at some level...

Torque/Maui Upgrade Lost Jobs?

Was the torque upgrade was not as smooth as I'd hoped?

It should have been just a minor upgrade, and thus pretty transparent, but some issues have arisen.

Firstly one of our local engineering users reported that the gatekeeper lost contact with all of his jobs - the jobs were still running in the batch queue, but globus-job-status reported them all done, and he couldn't get any output back.

Then I noticed that on the ATLAS production monitor page our 24 hour efficiency droped to the lowest ever level at 24%. This makes me rather worried that we lost all of our ATLAS jobs if the gatekeeper had a brain haemorrhage.

On the other hand, efficiency in the UK seems generally very low right now (ce02.tier2.hep.manchester.ac.uk, 26%; ce1.pp.rhul.ac.uk, 24%; fal-pygrid-18.lancs.ac.uk, 17%; lcgce01.gridpp.rl.ac.uk, 14%), so perhaps this is just a coincidence?

Doesn't explain the globus issues seen by our local user though.

Does anyone know the magic for getting into the guts of the gatekeeper and seeing which torque jobs it's connected to?

Torque default Queue

Problem - The NGS gits tests do not easily support naming the torque queue to be used.

Solution - Make the default queue "queue_type = route" and let it have all the VO based "queue_type=execution" queues as routing destinations
eg
set queue route2all route_destinations += dteam
set queue route2all route_destinations += atlas
set queue route2all route_destinations += alice
set queue route2all route_destinations += cms
set queue route2all route_destinations += lhcb
set queue route2all route_destinations += biom
.......... etc
It would seem that torque then tries the destination queues in turn and the first that matches the job accepts the job. Not only does this solve the NGS gits problem, but also seems more generally elegant as a scheme for the default queue.

Monday, March 12, 2007

svr031 recovery ongoing

There's been sadly little time to work on svr031. However, now work with our nuclear physicist community seems bound to involve the installation of new machines on the cluster, so things have to begin in earnest.

I have recovered the lftp mirror scripts, so that's a start here.

Next I have to look at making a proper SQLite database of machines and extracting a dhcpd.conf from that. Fortunately the tftp root was not lost, so that part of the boot process should be ok.

Blog Layout Updated

I've updated to the new blogger layout. This provides a much better archive option, using a collapsable tree view (so it's easier to find past posts) and <pre> text now wraps, instead of spilling into the sidebar.

The only thing I don't like is that the new default text style for posts is a little cramped - needs a little more line spacing IMO. If I knew about CSS I could probably fiddle with this in the tempate (but life's too short...)

Finally, safari doesn't quite work with the new layout editor, but firefox is fine.

Glasgow Update to gLite 3.0r16

This was the first gLite update with significant component changes:

* DPM was upgraded to 1.6.3, with a schema change and a new SRM v2.2 daemon.
* Torque and Maui were upgraded to v2.1.6 and v3.2.6, respectively, from the previously ancient LCG versions.

The DPM upgrade I tackled first. This was fine and I blogged about it on the storage blog. Just be careful to take a db dump before you try it, just in case things go wrong. There's also a strong warning against running automatic updates on gLite server nodes - this is not supported and some people are reporting DPM database corruption on LCG-ROLLOUT.

The torque/maui upgrade I was a little nervous about as I don't feel I greatly understand these components. However, Steve T had said that minor upgrades are ok (we'd been using the Steve T build for the cluster since the start, so we were already on torque v2), so I took the plunge. First I did a single worker, and restarted pbs_mom, so make sure the 2.1.6 mom didn't have trouble talking to the 2.1.5 server - and it didn't. So then I updated all the WNs, before turning my attention to the server.

Here, I did the usual yum -y update first. Then I restarted pbs_server and maui. pbs_server didn't restart cleanly, claiming something was bound to the port. I had a look, but by the time I did there was nothing - I think it was the server being sluggish to exit. So pbs_server then (re)started fine, and I did an extra maui restart to be on the safe side.

A basic check of the batch system (pbsnodes, qstat, diagnose) looked ok.

I have commented Steve's repository out of by yum.conf - we'll now use the "official" gLite build on Steve's advice.

N.B. I still intend to manage to batch system using cfengine, not YAIM - it's a lot more flexible for us to do this, e.g., the new routing queue being the default one.

Friday, March 09, 2007

Durham enables new VO's

Durham has enabled the following VO's -

  • cedar
  • ngs
  • gridpp
  • minos
  • totalep
  • camont
  • mice
This brings the number of supported VO's at Durham up to 28.

Network Asymmetry

Any transfer tests including Glasgow show up assymmetry in the network. We can take data in twice as fast as we can spit it out. Note to self: Follow up with Compserv networks.

Tuesday, March 06, 2007

R-GMA is so...

R-GMA just keeps falling over on the new site. Inexplicably tomcat will lock up and no response will be got from the R-GMA servlet.

This is very frustrating - there's nothing interesting in the logs either. it just dies.

Today when I was trying to restart it, tomcat restarted OK, but the R-GMA servlet would not startup, announcing cryptically:

2007-03-06 16:49:21,601 [http-8443-Processor23] WARN org.edg.info.ServletBase - Error processing servlet request.
org.glite.rgma.system.UnknownResourceException: Object has been closed: 77151190
at org.edg.info.InstanceTracker.get(InstanceTracker.java:155)
at org.glite.rgma.services.ResourceService.getInstance(ResourceService.java:179)
[...]
at java.lang.Thread.run(Thread.java:534)

What on earth's going on there?

Paul saved the day, using the tomcat manager interface to stop and then restart R-GMA. After that is was a bit happier.

I'll have to work out some nagios alarm or cfengine test for tomcat though - as it just hangs simple checks on the daemon's presence won't work. Grrr...

Monday, March 05, 2007

Progress Towards Enabling Total at Glasgow

I've spent all afternoon working out how to enable VOs on the new Glasgow cluster, using the new "totalep" VO as my template VO.

As this is the first time we've added a new VO since the cluster was setup, naturally this needs to be done rather carefully to ensure that all the bits go into the right place.

My modus operandi is to try and run YAIM as little as possible - so if a complex YAIM function does lots and lots of things, and runs 2 commands per VO, then I will try and extract those two things and give "by hand" instructions. Hopefully the only thing I will need to re-run YAIM for is the information system, which is damn fiddly to get right by hand.

Progress so far is on the wiki.

Once this has been done once and documented it should be much faster for any subsequent VOs.

ALICE Queue Disabled at Glasgow

I have been regularly pruning the orphaned ALICE processes off the Glasgow CE. This morning I had to kill 371. We've had very little response from the user concerned, and I can't see how they will be motivated to fix the situation as ALICE have not, and at the moment will not, run any jobs at Glasgow. (We have a 5% share for ALICE as this is GridPP policy and, of course, we generally think grids work better if lots of VOs are enabled.)

I did check the local ALICE queue and it functions correctly.

However, I've now closed the queue in the hope that this will prevent the globus processes from spawning. We know from experience that if these processes are allowed to accumulate too much then they affect the performance of the CE and thus affect all users of the site