ScotGrid

Friday, April 25, 2008

#include <documentation.h>

I quote from the DPM Developer Documentation

LFC/DPM Database schema
TO DO : describe non straight forward tables/fields....

so, with that in mind, I set about pulling out the number of SRM 2.2 requests vs the no of SRMv1 requests at the site. v1 should be constant (what with all the new users coming onboard) and SRM 2.2 being a rapid increase since we enabled it? well it's not easy to grep from the logs so I thought I'd poke the DB. First off in dpm_db.dpm_req r_type a char(1) field normally has g (get?) and p (put?) but we have just over 1500 rows where type is 'B' (broken?). hmm - all from flavia's DN and clienthost of lxdev25

my plots of the dpm usage are far too spikey to make sense of at the moment, but I'll work on presenting the info a bit clearer.
In the meantime I discovered that it's pretty obvious when we set torque to fill the jobslots in host order (made it easier to drain nodes off) and when we send nodes away to vendors.

Thursday, April 24, 2008

assimilation

I'd noticed that over the last month the load on our DPM headnode had been higher than before we switched on MonAMI checking of the DPM. Of course I instantly blamed the developer of said product. However, I disabled monami to prove that the load went down and lo... no change. Hmm.

I then started working out how to optimise the MySQL memory usage - we have about a 1.8G ascii file when I do a mysqldump -A and yet the innodb file takes up a whopping 4.4G on disk with tiny constantly rolling transaction logs of 5M.

As paul was here at cern (it's the WLCG Workshop this week) we got together to hammer out some changes to our implementation. When we logged onto svr018 (The DPM Headnode) I noticed that monami was running again. Turns out that cfengine was "helpfully" restarting the process for us. Grr.

So, an evening of infrastructure management changes:
- we had i386 monami rpms installed - we'd hard coded the repo path rather than using the $basearch variable in our local mirror.
- we had to ensure that we had backup=false in cfengine - where we had a config_dir directive (such as /etc/monami.d and /etc/nrpe.d) the applications were often trying to use someconfigfile.cfg and someconfigfile.cfg.cfsaved - ditto cron.d etc etc.
- we were sometimes trying to run 64 bit binaries on 32 bit architectures as we'd copied them straight from cfagent (normally nrpe monitors) - We've now using $(ostype) in cfagent which expands to linux_x86_64 and linux_i686 on our machines. Although cfengine sets a hard class of 32_bit and 64_bit but you can't use that in a variable.
- we now have the 'core' nrpe monitors (disk space, load, some process checks) installed on ALL servers not just the workernodes. Ahem. Thought we'd implemented that before.
- we've upgraded to the latest CVS rpm of monami on some nodes and we've got grooovy mysql monitoring. - oh and the load's gone down too.

Tuesday, April 15, 2008

Oh no, not again...

We went though a little rash of SAM test failures last night. This turned out to be an LHCb user who was submitting jobs which filled up the scratch area on the worker nodes and turned them into blackholes.

Obligatory GGUS ticket was raised.

We do alarm against disk space filling up on the worker nodes, but it was still 4 hours before action was taken and the nodes set offline before being cleaned. In that time an awful lot of jobs were destroyed. Make me think we might want to automate the offlining of nodes which run out of disk space, pending investigations.

Saturday, April 12, 2008

Splunk / nagios / logrotate

Well, I upgraded to nagios3 this evening on the cluster and noticed it had a new enable_splunk_integration option in the cgi.cfg - I'd looked at splunk before and thought 'hmm, nice idea, not sure it'll work with the grid stuff' but decided to give it a whirl

first up - nagios gotchas - We had the dag rpm installed which hasn't been updated to the 3.0 let alone the 3.0.1 release so went for the manual compile option. Discovered that the (gd|libjpeg|libpng)-devel packages weren't installed - quickly fixed by yum.

took the ./configure line from the spec as a guide - however it managed to splat the cgi's into /usr/sbin rather than /usr/lib64/nagios/cgi - thanks :-( soon found em and moved em round. seems to be working OK - not installed the newer wlcg monitors yet - thats the next task.

Splunk - looks flash but is it any good? There's no sign of any educational pricing on their website and the 'free' version has one HUGE weakness - no user authorisation / login. Temp workaround of some iptables rules to reduce risk and had a play. Defined /var/log on our central syslog server as a datasource and watched it go.

well, sort of... it promptly filled /opt/splunk as it makes an indexed copy of anything it finds, - I think for a real install we'd need some new space on a disk. secondly it quicky swallowed more than its 500M/day 'free' allowance - grabbed a 30day trial licence of the enterprise version and lo it now complains that I've had 2 licence violations of over 5G/day indexed. Harumph.

not sure if this would settle down once it goes through the backlog of the archived logfiles - perhaps if I implement only a syslog FIFO for it it'd be happier. Also we have the 'traditional' logrotate style of .1 .2 .3 etc rather than the more dirvish friendly dateext option - we should really swap... if the RHEL logrotate supports it :-/

"rpm -q logrotate --changelog" doesnt mention it although its fixed in fedora

The other issue is that splunk thrashes the box as it indexes, and it's just stopped as its filled the disk again. Ho Hum.

Wednesday, April 02, 2008

A long time coming: UKI-SCOTGRID-ECDF on APEL

So, yes, it's probably taken a little longer than it might have, but UKI-SCOTGRID-ECDF is now publishing all its accounting data back to early January.
Here

Of course, ops has a disturbingly high share of the Grid usage at the moment, but hopefully we will start to get ATLAS (and maybe even LHCb) jobs filtering in in the near future...

ECDF running for ATLAS

ECDF have now passed ATLAS production validation. The last link in the chain was ensuring that their SRMv1 endpoint was enabled on the UK's DQ2 server at CERN - this allows the ATLAS data management infrastructure to move input data to ECDF.

After that problem was corrected this morning the input data was moved from RAL, a production pilot picked up the job and ran it then the output data was moved back to RAL.

I have asked the ATLAS production people to enable ECDF in the production system and I have turned up the pilot rate to pull in more jobs.

We had a problem with the software area not being group writable (for some reason Alessando's account mapping changed), but this has now been corrected and an install of 14.0.0 has been started.

It's wonderful to now have the prospect of running significant amounts of grid work on the ECDF resource. Well done to everyone!

Only one bite at the cherry...

I have modified the default RetryCount on our UIs to now set zero retries. Automatic retries were actually working quite well for us when we were losing a lot of nodes to MCE errors (in the days before the upgrade to SL4, x86_64) - users' jobs would automatically rerun if they got lost and there was no need for them to worry about failures. However, recently we see users submitting more problematic jobs to the cluster - some which fail to start at all, some which run off into wallclock limits, others which stall half way through. Often we have to gut the batch system with our special spoon and in this case having to do it four times because the RB/WMS keeps resubmitting the job is less then helpful.

For once cfengine's editfiles stanza was useful and a simple:

ui::
{ /opt/glite/etc/glite_wmsui_cmd_var.conf
ReplaceFirst "RetryCount\s+=\s+[1-9];" With "RetryCount = 0;"
}
{ /opt/egd/etc/edg_wl_ui_cmd_var.conf
ReplaceFirst "RetryCount\s+=\s+[1-9];" With "RetryCount = 0;"
}

got the job done.

Tuesday, April 01, 2008

Emergency Outage

Due to a vulnerability to the security flag in the ipv4 header, we will be taking uki-scotgrid-glasgow offline today for an upgrade. In order to minimise downtime we shall be rebooting all worker nodes simultaneously rather than draining queues.

We aim to have this work completed by 12:00 midday today.

Please see RFC 3514 for more details. We advise other sites to perform this upgrade asap.

Friday, March 28, 2008

brain dead batch systems

why oh why are some of the batch utilities so brain dead? simple case of 'qstat -r' should show who's running jobs right? wrong as it outputs based on a fixed 8 character width for username. doh. so 'prdatlas' and biomed06 seem to be busy. Well not quite as if I do a qstat -f | egrep " e(group|user) " | sort -u I see that it's prdatlas028 and several biomed06? users. grr...

I may install Job Monarch from sara but in the meantime it'll be some hacky PHP to parse the outputs a bit more cleanly

Also, despite having 493 running jobs at the moment (we're down on our capacity as I'm still doing a rolling upgrade to SL4.6 and a new kernel) there are a very small number of users on the system

svr031:~# qstat -f | grep euser | sort -u | wc -l
14

not good, Especially if they decide to take a break.

Thursday, March 27, 2008

p p p pick up a pakiti

We've been using pakiti at Glasgow for some time now for keeping an eye on which nodes are out of date. One minor niggle is that it doesn't keep track of the grub default kernel (ie what should come in on reboot) compared to the running kernel

We already had a v simple shell script that did that:


 pdsh -w node[001-140] chkkernel.sh | dshbak -c
----------------
node[001,005,007,014,016-020,022-023,025,028,031-061,063-085,087-090,092,095-096,098-101,103-104,106-107,109-110,113,115,118-120]
----------------
 Running: 2.6.9-67.0.7.ELsmp, Grub: 2.6.9-67.0.7.ELsmp, Status OK
----------------
node[062,091,093-094,097,102,105,108,111-112,114,116-117,121-127,129,131,133-134,136-139]
----------------
 Running: 2.6.9-55.0.9.ELsmp, Grub: 2.6.9-67.0.4.ELsmp, Status error
----------------
node[003,009,011,013,015,021,027,029]
----------------
 Running: 2.6.9-55.0.12.ELsmp, Grub: 2.6.9-67.0.7.ELsmp, Status error
----------------
node[128,130,132]
----------------
 Running: 2.6.9-55.0.12.ELsmp, Grub: 2.6.9-67.0.4.ELsmp, Status error
----------------
node[002,004,006,010,012,024,026,030,140]
----------------
 Running: 2.6.9-55.0.9.ELsmp, Grub: 2.6.9-67.0.7.ELsmp, Status error
----------------
node[086,135]
----------------
 Running: 2.6.9-67.0.4.ELsmp, Grub: 2.6.9-67.0.4.ELsmp, Status OK
----------------
node008
----------------
 Running: 2.6.9-67.0.4.ELsmp, Grub: 2.6.9-67.0.7.ELsmp, Status error

but I finally got it integrated with some patching - see http://www.scotgrid.ac.uk/wiki/index.php/Pakiti

result - pretty green / red status on the "default kernel' column.

The patches have been emailed to Romain so may well appear upstream eventually

Wednesday, March 26, 2008

Edinburgh as an LHCb Tier-1?

I've just been accused (jokingly, I hope) of trying to turn Edinburgh into LHCb's 7th Tier-1. The attached plot shows the recent data transfers that I have been running into our dCache. The rates are good (~35MB/s), but not particularly special. However, against a background of zero, it certainly made LHCb jump up and take notice ;) Maybe this will convince them that Tier-2s really can be used for analysis jobs...

I should note that during these transfers one of the dCache pools was about to melt (see below). I've since reduced the max number of movers on each pool to something more reasonable. For the tests, I created a small application that spawned ~50 simultaneously lcg-cp's that were all transferring files from CERN CASTOR to Edinburgh. Who needs FTS when you've got DIRAC and lcg_utils? Now all I need is someone else's proxy and I'll never be caught... ;) But, on a serious note, I suppose this does show that people can create tools to abuse the system and get round the official FTS channels, which could impact the service for other users.

The curse of 79...

Since the dawn of the Glasgow cluster we have been cursed with a low level of globus 79 errors. We did not understand these well, but always believed that they were caused by a confusion in the gatekeeper, where the X509 authentication seemed to suffer a race condition and get muddled between users.

However, since upgrading to an SL4 CE and installing it on a different machine we still get these cropping up (an example).

The GOC Wiki suggests this can be caused by firewall trouble or an incorrect GLOBUS_TCP_PORT_RANGE. Now, this is (and was) correctly defined on both machines to be the standard 20000-25000. However, I have decided to change it to 50000-55000 in case we are tripping some generic nasty filter somewhere else on campus.

Since I did that, last night, we haven't had a 79 error - however this proves nothing so far as we can easily go for a week without one of these happening.

I also contacted the campus networking people to ask if there was any known port blocks in this range.

Data Management and MICE

I had a chat to one of our MICE PhD students a couple of weeks ago and I was explaining how to use EGEE data management (SRMs, LFCs, FTS, lcg utils, etc.). His comment afterwards was "I didn't know I was going to do a PhD in data management...".

The problem is that all these tools are very low level, so any user community has to build a higher level infrastructure on top of this. Obviously the LHC experiments have done this extensively, but it is frustrating that there is no simple generic data management service for smaller VOs who lack the resources of the larger VOs.

I wonder if this accounts for the popularity of SRB in other communities? It may have some limitations, but it clearly offers a higher level data storage, cataloging and metadata service which must be attractive for smaller user communities. Surely there is a potential project to try and tie all of the EGEE components into a sensible data management system?

Saturday, March 22, 2008

Durham - SL4 Install Success!

Durham took the plunge earlier this week to upgrade the CE, SE and all nodes to SL4.6... with success! After our preparation was delayed slightly due to a small UPS failure, we set about installing cfengine to handle the fabric management. This took a little longer than expected but our patience has paid off and it eases the pain of setup and config of clusters. Using the normal RedHat Kickstart to get a base install of SL4.6, we then hand the rest of the setup to cfengine to work its magic (install extra packages, setup config files, run YAIM etc).

Firstly installing a Worker Node was relatively straight forward. Then came the CE along with torque, PBS and the site BDII setup. Thanks to Graeme for help checking our site was working and publishing as expected.

We unexpectedly hit a firewall issue as I had renamed the CE from the old "helmsley.dur.scotgrid.ac.uk" to "ce01.dur.scotgrid.ac.uk"... though I had preserved the IP address. Not what I expected but our network guys were able to fix the rules and we were operational again.

Then the SE followed very quickly afterwards, cfengine and YAIM working their magic very successfully. The procedure was as simple as 1) dump of the database, 2) install SL4.6, 3) Let cfengine do its stuff for a base install, 4) restore the database, 5) Run YAIM. Simple!

Just one gotcha was trying to change the NFS mounted home directories to be local to the nodes. This fails with an error trying to copy the globus-cache-export files. Due to time constraints we have re-enabled the NFS home dirs... but I'm sure this will be simple to fix and I'll look at it next week.

Fair shares and queue time will need reviewing but in all a busy and successful few days. We're passing SAM tests and I've seen Phenogrid, Atlas and Biomed running jobs. Still the UI and a disk server to do, but with cfengine in place, this should be relatively straight forward and will require no downtime.

Wednesday, March 12, 2008

Another ECDF/Grid requirement mismatch.

While ECDF is, in principle, functional and capable of running jobs, this is a bit useless if no-one can see if you're doing it. So, in the face of APEL accounting still not working for the cluster, I had another look.

There were two problems:
Firstly, the sge account parser was looking in the wrong directory for SGE accounting logs - this fails silently with "no new records found", so I didn't notice before. The configured location actually was correct when I set the thing up, but the mount point had been moved since (as the CE is not on the same box as the SGE Master, we export the SGE account directory over NFS to the CE so it can parse them) with no indication that anything was wrong.

Secondly, after I fixed this...
It turns out that the APEL java.lang.OutOfMemoryError strikes again for ECDF.
The ECDF systems team configure the SGE accounting to roll over accounting logs on a monthly basis. Unfortunately, this leads to rather large accounting files:
# ls --size /opt/sge/default/common/acc* --block-size=1M
1543 /opt/sge/default/common/accounting

(yes, gentlemen and ladies, that is a one and a half gig accounting file...and we're only half-way through the month. The archived accounting logs tip the scales at around a quarter to half a gig compressed, but they compress rather efficiently so the "true" size is much larger - up to 10x larger, in fact.)

I suspect the next step is to arrange some way of chopping the accounting file
into bitesized chunks that the APEL log parser is capable of swallowing.
The irony is that we already parse the accounting logs internally using a thing called ARCo - I've not seen any indication that it would be easy to get APEL to understand the resulting database, though.

Monday, February 25, 2008

Dem info system blues

I fixed a problem on the CE information system tonight. YAIM had gone a little screwy and incorrectly written the lcg-info-dynamic-scheduler.conf file, so I had added the lrms_backend_cmd parameter myself as:

lrms_backend_cmd: /opt/lcg/libexec/lrmsinfo-pbs -h svr016.gla.scotgrid.ac.uk

Adding the host seemed sensible as the CE and the batch system don't run on the same node, right? Wrong! the host paramater ends up being passed down to "qstat -f HOST" which is a broken command - we ended up with zeros everywhere for queued and running jobs and, consequently a large stack of biomed jobs we are unlikely ever to run.

I raised the obligatory GGUS ticket: https://gus.fzk.de/pages/ticket_details.php?ticket=33313

To VOMS or not to VOMS? That is the question (for LCMAPS...)

Our advice to local users of the cluster has traditionally been to not use VOMS credentials. This is to ensure that they are mapped in the batch system to their local account, rather than to a pool account from their VOMS attributes (mappings to local accounts are maintained by the grid-mapfile-local file). In the default configuration of LCMAPS VOMS pool account mappings are made before the grid-mapfile, which is now just a fall back.

However, I could not simply reverse the order of the LCMAPS plugins as this would undo all the good which VOMS brings and move everyone back to a single fixed or pool account mapping no matter what their VOMS credentials (this would probably have affected me worse than anyone as I flit between atlas, atlas/Role=production, vo.scotgrid.ac.uk and dteam!).

So, for local users grid-proxy-init seemed to be the way to go, even if I knew this would come back and be a problem later. However, later became earlier as soon as I started to test the gLite-WMS - here it turns out you must use a VOMS proxy. Simple grid proxies just don't work anymore.

Finally, puzzling over the very poor LCMAPS documentation, and staring at the configuration script I managed to solve the problem by:

First running a local account plugin against a grid-mapfile which only contains our local user accounts.
Then running the VOMS plugins as usual.
Finally, running the grid-mapfile plugin, against the usual /etc/grid-security/grid-mapfile.

This was almost too easy to be true - and indeed it turns out not to quite be that simple as you hit a bug in LCMAPS that you cannot use a module twice - so having lcmaps_localaccount.mod twice is not possible. However, it turns out that one can do it if the module is renamed and physically copied. This works, so we now have an lcmaps_localaccount.mod and a lcmaps_localuseraccount.mod - exactly the same bytes, different names! (To be strictly accurate we have two copies of liblcmaps_localaccount.so.0.0.0, to which these links point.)

And, in the end, I was able to keep myself out of the local user grid-mapfile, so I have the full array of VOMS roles for myself, while the local users are cosily tucked up in their local account areas.

Upgrade to gLite 3.1 - Notebook

It was well over a year since we'd done a "from the top" install of the CE, so a few things were different:

The information system has been re-branded. It's now configured in /opt/glite/etc/gip, although many of the plugins are still running from /opt/lcg.
The CE information system is upgraded to use the BDII (on 2170).
The site BDII also now uses a wrapper script to get all information, rather than coding the list of GRISs/BDIIs to query (GIP file:///opt/glite/libexec/glite-info-wrapper).
LCAS and LCMAPS now also run out of /opt/glite.
Pool account mappings are now done to a random pool account, rather than the "next" free one. In addition the hardlink syntax used for assigning a pool account to a DN has changed slightly (using colons to indicate VOMS attributes after the DN).

Funeral March for the Lost CE

So, here's the post mortem on the CE hard crash on Wednesday last. About 1700 the load on the CE started to ramp up and it quickly rose to almost 100. I could see this happening just as I was about to go home (typical!) so I started to indulge in a frantic bout of process killing to reduce load and bring the CE back under control. However, despite my best efforts, the CE crashed hard at 1800 (gap in the ganglia plot).

When the machine rebooted, the gatekeeper restarted and again the load began to rise. I then went through a frantic couple of hours trying to do everything I could to reduce the load and try an get the CE back on an even keel - this was made very hard by the fact that with load averages quickly rising to 60+ the machine was extremely sluggish.

I shut down R-GMA, turned off the mail server to no avail. I killed off queued jobs in the batch system, even got as far as disabling VOs, and banning users whose jobs I had cancelled. I even got so desparete as to firewall the gatekeeper from all but the ScotGrid RB! But although I coud slow down the load increase by doing this, by 10pm it became clear that something dreadful had happened to the gatekeeper. Every gatekeeper process which was forked stalled, consuming CPU and managing to do absolutely nothing. As there was no response, the RB then contected the CE again, forking off another gatekeeper and the march to death continued. If I reduced the number of users able to contact the CE this slowed down the rate of resource exhaustion, but could not stop it. Clearly something utterly evil had happened to the gatekeeper state.

At this point I became convinced that nothing could be done to save the remaining queued or running jobs and that the site was going down. I started to think instead about moving our March downtime forwards, to do the SL4 upgrades, and to prise the CE and the batch system apart. And of course, that is just what we did at the end of last week.

Friday, February 22, 2008

Acrobat ate my disk servers!

Glasgow is finally out of downtime. GS worked his grid-fu and managed to upgrade lots to SL4 - Admittedly some things (RGMA) weren't a goer. APEL Accounting could be broken for a while as we've now split the CE (new home = svr021) and the Torque server (still on svr016). My 'simple' job was to take care of the DPM servers...

Simple enough, we hacked into the YAIM site-info.def stuff and separated things out into services/ and vo.d/ - easy. Few gotchas as cfengine was once again reluctant to create the symlinks on the target nodes (however creating the symlinks on the master and replicating those works fine) which we thought might be fixed by an upgrade of cfengine from 2.1.22 to 2.2.3. Big mistake. it broke HostRange function of cfengine.

so we have

 dpmdisk = ( HostRange(disk,032-036) HostRange(disk,038-041) )

but cfengine complained that


SRDEBUG FuzzyHostParse(disk,032-041) succeeded for disk033.gla.scotgrid.ac.uk
SRDEBUG FuzzyHostMatch: split refhost=disk033.gla.scotgrid.ac.uk into refbase=disk033.gla.scotgrid.ac.uk and cmp=-1
SRDEBUG FuzzyHostMatch(disk,032-041,disk033.gla.scotgrid.ac.uk) failed

now I'm not sure if this is due to the problem of short hostname vs FQDN - I've hit a similar issue when I want to copy iptables configs off -

$(skel)/nat/etc/iptables.$(host) mode=0600 dest=/etc/iptables define=newiptables type=sum

needs iptables.host.gla.scotgrid.ac.uk not just iptables.host on the master repo.

Anyway, this all seems trivial compared to the hassle with the latest SLC 4X that got mirrored up to the servers overnight (the disk servers run SLC4 rather than SL4 as the Areca raid card drivers are compiled in) - dpm-queryconf kept failing with

send2nsd: NS002 - send error : No valid credential found

and yet the certificates were there and valid - openssl verify ... returned OK, ddates were valid, NTP installed etc. The dpm log showed

dpm_serv: Could not establish security context: _Csec_recv_token: Connection dropped by remote end !

The really frustrating thing was that the server that I installed from home while munching breakfast (all hail laptops and broadband) worked fine, but those I installed (and reinstalled) later in the office were broken. [hmm. is this a sign that I should stay at home in the mornings and have a leisurely breakfast?]

Puzzling was the fact that the broken servers had more rpms installed than the working ones. - I eventually resorted to installing strace on both boxes and diffing the output of 'strace dpm-qryconf'

the failing one had a big chunk of


open("/lib/tls/i686/sse2/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/lib/tls/i686/sse2", 0xffff9028) = -1 ENOENT (No such file or directory)
open("/lib/tls/i686/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/lib/tls/i686", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/lib/tls/sse2/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/lib/tls/sse2", 0xffff9028)     = -1 ENOENT (No such file or directory)
open("/lib/tls/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/lib/tls", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/lib/i686/sse2/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/lib/i686/sse2", 0xffff9028)    = -1 ENOENT (No such file or directory)
open("/lib/i686/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/lib/i686", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/lib/sse2/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/lib/sse2", 0xffff9028)         = -1 ENOENT (No such file or directory)
open("/lib/libstdc++.so.6", O_RDONLY)   = -1 ENOENT (No such file or directory)
stat64("/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/usr/lib/tls/i686/sse2/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/tls/i686/sse2", 0xffff9028) = -1 ENOENT (No such file or directory)
open("/usr/lib/tls/i686/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/tls/i686", 0xffff9028) = -1 ENOENT (No such file or directory)
open("/usr/lib/tls/sse2/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/tls/sse2", 0xffff9028) = -1 ENOENT (No such file or directory)
open("/usr/lib/tls/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/tls", 0xffff9028)      = -1 ENOENT (No such file or directory)
open("/usr/lib/i686/sse2/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/i686/sse2", 0xffff9028) = -1 ENOENT (No such file or directory)
open("/usr/lib/i686/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/i686", 0xffff9028)     = -1 ENOENT (No such file or directory)
open("/usr/lib/sse2/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/sse2", 0xffff9028)     = -1 ENOENT (No such file or directory)
open("/usr/lib/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/usr/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0

whereas the working one didn't call this at all.

I was also bemused as to why acroread had been installed on the server and more annoyingly why I couldn't uninstall,

Yep - someone (step up to the podium Jan Iven) had mispackaged the SLC acroread 8.1.2 update...

rpm -qp ./acroread-8.1.2-1.slc4.i386.rpm --provides
warning: ./acroread-8.1.2-1.slc4.i386.rpm: V3 DSA signature: NOKEY, key ID 1d1e034b
2d.x3d  
3difr.x3d  
ADMPlugin.apl  
Accessibility.api  
AcroForm.api  
Annots.api  
DVA.api  
DigSig.api  
EFS.api  
EScript.api  
HLS.api  
MakeAccessible.api  
Multimedia.api  
PDDom.api  
PPKLite.api  
ReadOutLoud.api  
Real.mpp  
SaveAsRTF.api  
SearchFind.api  
SendMail.api  
Spelling.api  
acroread-plugin = 8.1.2-1.slc4
checkers.api  
drvOpenGL.x3d  
drvSOFT.x3d  
ewh.api  
libACE.so  
libACE.so(VERSION)  
libACE.so.2.10  
libACE.so.2.10(VERSION)  
libAGM.so  
libAGM.so(VERSION)  
libAGM.so.4.16  
libAGM.so.4.16(VERSION)  
libAXE8SharedExpat.so  
libAXE8SharedExpat.so  
libAXE8SharedExpat.so(VERSION)  
libAXSLE.so  
libAXSLE.so  
libAXSLE.so(VERSION)  
libAXSLE.so(VERSION)  
libAdobeXMP.so  
libAdobeXMP.so  
libAdobeXMP.so(VERSION)  
libAdobeXMP.so(VERSION)  
libBIB.so  
libBIB.so(VERSION)  
libBIB.so.1.2  
libBIB.so.1.2(VERSION)  
libBIBUtils.so  
libBIBUtils.so(VERSION)  
libBIBUtils.so.1.1  
libBIBUtils.so.1.1(VERSION)  
libCoolType.so  
libCoolType.so(VERSION)  
libCoolType.so.5.03  
libCoolType.so.5.03(VERSION)  
libJP2K.so  
libJP2K.so  
libJP2K.so(VERSION)  
libResAccess.so  
libResAccess.so(VERSION)  
libResAccess.so.0.1  
libWRServices.so  
libWRServices.so(VERSION)  
libWRServices.so.2.1  
libadobelinguistic.so  
libadobelinguistic.so  
libadobelinguistic.so(VERSION)  
libahclient.so  
libahclient.so  
libahclient.so(VERSION)  
libcrypto.so.0.9.7  
libcrypto.so.0.9.7  
libcurl.so.3  
libdatamatrixpmp.pmp  
libextendscript.so  
libextendscript.so  
libgcc_s.so.1  
libgcc_s.so.1(GCC_3.0)  
libgcc_s.so.1(GCC_3.3)  
libgcc_s.so.1(GCC_3.3.1)  
libgcc_s.so.1(GCC_3.4)  
libgcc_s.so.1(GCC_3.4.2)  
libgcc_s.so.1(GCC_4.0.0)  
libgcc_s.so.1(GLIBC_2.0)  
libicudata.so.34  
libicudata.so.34  
libicui18n.so.34  
libicuuc.so.34  
libicuuc.so.34  
libpdf417pmp.pmp  
libqrcodepmp.pmp  
librt3d.so  
libsccore.so  
libsccore.so  
libssl.so.0.9.7  
libssl.so.0.9.7  
libstdc++.so.6  
libstdc++.so.6(CXXABI_1.3)  
libstdc++.so.6(CXXABI_1.3.1)  
libstdc++.so.6(GLIBCXX_3.4)  
libstdc++.so.6(GLIBCXX_3.4.1)  
libstdc++.so.6(GLIBCXX_3.4.2)  
libstdc++.so.6(GLIBCXX_3.4.3)  
libstdc++.so.6(GLIBCXX_3.4.4)  
libstdc++.so.6(GLIBCXX_3.4.5)  
libstdc++.so.6(GLIBCXX_3.4.6)  
libstdc++.so.6(GLIBCXX_3.4.7)  
nppdf.so  
prcr.x3d  
tesselate.x3d  
wwwlink.api  
acroread = 8.1.2-1.slc4

Yep, thats right - RPM had decided that acroread was a dependency. Grr. Workaround - remirror the CERN SLC repo, (no they hadn't updated since), manually remove the offending rpm, and rebuild the metadata with 'createrepo'

Then make sure that the nodes were rebuilt and only ever looked to our local repository rather than the primary cern / dag ones. (thanks to kickstart %post and cfengine)

Finally we got yaim to *almost* run - it was failing on lcmaps and grid-mapfile creation (fixed by unsupporting a certain VO)
Easy fix in comparison. Anyway - DPM up n running and seems OK. Roll on the next SAM tests.... (or real users)

Phew.

Grid Middleware should not be this hard to install!