Monday, February 25, 2008

Dem info system blues

I fixed a problem on the CE information system tonight. YAIM had gone a little screwy and incorrectly written the lcg-info-dynamic-scheduler.conf file, so I had added the lrms_backend_cmd parameter myself as:

lrms_backend_cmd: /opt/lcg/libexec/lrmsinfo-pbs -h svr016.gla.scotgrid.ac.uk

Adding the host seemed sensible as the CE and the batch system don't run on the same node, right? Wrong! the host paramater ends up being passed down to "qstat -f HOST" which is a broken command - we ended up with zeros everywhere for queued and running jobs and, consequently a large stack of biomed jobs we are unlikely ever to run.

I raised the obligatory GGUS ticket: https://gus.fzk.de/pages/ticket_details.php?ticket=33313

To VOMS or not to VOMS? That is the question (for LCMAPS...)

Our advice to local users of the cluster has traditionally been to not use VOMS credentials. This is to ensure that they are mapped in the batch system to their local account, rather than to a pool account from their VOMS attributes (mappings to local accounts are maintained by the grid-mapfile-local file). In the default configuration of LCMAPS VOMS pool account mappings are made before the grid-mapfile, which is now just a fall back.

However, I could not simply reverse the order of the LCMAPS plugins as this would undo all the good which VOMS brings and move everyone back to a single fixed or pool account mapping no matter what their VOMS credentials (this would probably have affected me worse than anyone as I flit between atlas, atlas/Role=production, vo.scotgrid.ac.uk and dteam!).

So, for local users grid-proxy-init seemed to be the way to go, even if I knew this would come back and be a problem later. However, later became earlier as soon as I started to test the gLite-WMS - here it turns out you must use a VOMS proxy. Simple grid proxies just don't work anymore.

Finally, puzzling over the very poor LCMAPS documentation, and staring at the configuration script I managed to solve the problem by:

  1. First running a local account plugin against a grid-mapfile which only contains our local user accounts.
  2. Then running the VOMS plugins as usual.
  3. Finally, running the grid-mapfile plugin, against the usual /etc/grid-security/grid-mapfile.
This was almost too easy to be true - and indeed it turns out not to quite be that simple as you hit a bug in LCMAPS that you cannot use a module twice - so having lcmaps_localaccount.mod twice is not possible. However, it turns out that one can do it if the module is renamed and physically copied. This works, so we now have an lcmaps_localaccount.mod and a lcmaps_localuseraccount.mod - exactly the same bytes, different names! (To be strictly accurate we have two copies of liblcmaps_localaccount.so.0.0.0, to which these links point.)

And, in the end, I was able to keep myself out of the local user grid-mapfile, so I have the full array of VOMS roles for myself, while the local users are cosily tucked up in their local account areas.

Upgrade to gLite 3.1 - Notebook

It was well over a year since we'd done a "from the top" install of the CE, so a few things were different:
  • The information system has been re-branded. It's now configured in /opt/glite/etc/gip, although many of the plugins are still running from /opt/lcg.
  • The CE information system is upgraded to use the BDII (on 2170).
  • The site BDII also now uses a wrapper script to get all information, rather than coding the list of GRISs/BDIIs to query (GIP file:///opt/glite/libexec/glite-info-wrapper).
  • LCAS and LCMAPS now also run out of /opt/glite.
  • Pool account mappings are now done to a random pool account, rather than the "next" free one. In addition the hardlink syntax used for assigning a pool account to a DN has changed slightly (using colons to indicate VOMS attributes after the DN).

Funeral March for the Lost CE



So, here's the post mortem on the CE hard crash on Wednesday last. About 1700 the load on the CE started to ramp up and it quickly rose to almost 100. I could see this happening just as I was about to go home (typical!) so I started to indulge in a frantic bout of process killing to reduce load and bring the CE back under control. However, despite my best efforts, the CE crashed hard at 1800 (gap in the ganglia plot).

When the machine rebooted, the gatekeeper restarted and again the load began to rise. I then went through a frantic couple of hours trying to do everything I could to reduce the load and try an get the CE back on an even keel - this was made very hard by the fact that with load averages quickly rising to 60+ the machine was extremely sluggish.

I shut down R-GMA, turned off the mail server to no avail. I killed off queued jobs in the batch system, even got as far as disabling VOs, and banning users whose jobs I had cancelled. I even got so desparete as to firewall the gatekeeper from all but the ScotGrid RB! But although I coud slow down the load increase by doing this, by 10pm it became clear that something dreadful had happened to the gatekeeper. Every gatekeeper process which was forked stalled, consuming CPU and managing to do absolutely nothing. As there was no response, the RB then contected the CE again, forking off another gatekeeper and the march to death continued. If I reduced the number of users able to contact the CE this slowed down the rate of resource exhaustion, but could not stop it. Clearly something utterly evil had happened to the gatekeeper state.

At this point I became convinced that nothing could be done to save the remaining queued or running jobs and that the site was going down. I started to think instead about moving our March downtime forwards, to do the SL4 upgrades, and to prise the CE and the batch system apart. And of course, that is just what we did at the end of last week.

Friday, February 22, 2008

Acrobat ate my disk servers!

Glasgow is finally out of downtime. GS worked his grid-fu and managed to upgrade lots to SL4 - Admittedly some things (RGMA) weren't a goer. APEL Accounting could be broken for a while as we've now split the CE (new home = svr021) and the Torque server (still on svr016). My 'simple' job was to take care of the DPM servers...

Simple enough, we hacked into the YAIM site-info.def stuff and separated things out into services/ and vo.d/ - easy. Few gotchas as cfengine was once again reluctant to create the symlinks on the target nodes (however creating the symlinks on the master and replicating those works fine) which we thought might be fixed by an upgrade of cfengine from 2.1.22 to 2.2.3. Big mistake. it broke HostRange function of cfengine.

so we have
 dpmdisk = ( HostRange(disk,032-036) HostRange(disk,038-041) ) 

but cfengine complained that

SRDEBUG FuzzyHostParse(disk,032-041) succeeded for disk033.gla.scotgrid.ac.uk
SRDEBUG FuzzyHostMatch: split refhost=disk033.gla.scotgrid.ac.uk into refbase=disk033.gla.scotgrid.ac.uk and cmp=-1
SRDEBUG FuzzyHostMatch(disk,032-041,disk033.gla.scotgrid.ac.uk) failed


now I'm not sure if this is due to the problem of short hostname vs FQDN - I've hit a similar issue when I want to copy iptables configs off -
$(skel)/nat/etc/iptables.$(host) mode=0600 dest=/etc/iptables define=newiptables type=sum 
needs iptables.host.gla.scotgrid.ac.uk not just iptables.host on the master repo.

Anyway, this all seems trivial compared to the hassle with the latest SLC 4X that got mirrored up to the servers overnight (the disk servers run SLC4 rather than SL4 as the Areca raid card drivers are compiled in) - dpm-queryconf kept failing with
send2nsd: NS002 - send error : No valid credential found
and yet the certificates were there and valid - openssl verify ... returned OK, ddates were valid, NTP installed etc. The dpm log showed
dpm_serv: Could not establish security context: _Csec_recv_token: Connection dropped by remote end ! 


The really frustrating thing was that the server that I installed from home while munching breakfast (all hail laptops and broadband) worked fine, but those I installed (and reinstalled) later in the office were broken. [hmm. is this a sign that I should stay at home in the mornings and have a leisurely breakfast?]

Puzzling was the fact that the broken servers had more rpms installed than the working ones. - I eventually resorted to installing strace on both boxes and diffing the output of 'strace dpm-qryconf'

the failing one had a big chunk of

open("/lib/tls/i686/sse2/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/lib/tls/i686/sse2", 0xffff9028) = -1 ENOENT (No such file or directory)
open("/lib/tls/i686/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/lib/tls/i686", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/lib/tls/sse2/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/lib/tls/sse2", 0xffff9028) = -1 ENOENT (No such file or directory)
open("/lib/tls/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/lib/tls", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/lib/i686/sse2/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/lib/i686/sse2", 0xffff9028) = -1 ENOENT (No such file or directory)
open("/lib/i686/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/lib/i686", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/lib/sse2/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/lib/sse2", 0xffff9028) = -1 ENOENT (No such file or directory)
open("/lib/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
open("/usr/lib/tls/i686/sse2/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/tls/i686/sse2", 0xffff9028) = -1 ENOENT (No such file or directory)
open("/usr/lib/tls/i686/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/tls/i686", 0xffff9028) = -1 ENOENT (No such file or directory)
open("/usr/lib/tls/sse2/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/tls/sse2", 0xffff9028) = -1 ENOENT (No such file or directory)
open("/usr/lib/tls/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/tls", 0xffff9028) = -1 ENOENT (No such file or directory)
open("/usr/lib/i686/sse2/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/i686/sse2", 0xffff9028) = -1 ENOENT (No such file or directory)
open("/usr/lib/i686/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/i686", 0xffff9028) = -1 ENOENT (No such file or directory)
open("/usr/lib/sse2/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/sse2", 0xffff9028) = -1 ENOENT (No such file or directory)
open("/usr/lib/libstdc++.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/usr/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0

whereas the working one didn't call this at all.

I was also bemused as to why acroread had been installed on the server and more annoyingly why I couldn't uninstall,

Yep - someone (step up to the podium Jan Iven) had mispackaged the SLC acroread 8.1.2 update...

rpm -qp ./acroread-8.1.2-1.slc4.i386.rpm --provides
warning: ./acroread-8.1.2-1.slc4.i386.rpm: V3 DSA signature: NOKEY, key ID 1d1e034b
2d.x3d
3difr.x3d
ADMPlugin.apl
Accessibility.api
AcroForm.api
Annots.api
DVA.api
DigSig.api
EFS.api
EScript.api
HLS.api
MakeAccessible.api
Multimedia.api
PDDom.api
PPKLite.api
ReadOutLoud.api
Real.mpp
SaveAsRTF.api
SearchFind.api
SendMail.api
Spelling.api
acroread-plugin = 8.1.2-1.slc4
checkers.api
drvOpenGL.x3d
drvSOFT.x3d
ewh.api
libACE.so
libACE.so(VERSION)
libACE.so.2.10
libACE.so.2.10(VERSION)
libAGM.so
libAGM.so(VERSION)
libAGM.so.4.16
libAGM.so.4.16(VERSION)
libAXE8SharedExpat.so
libAXE8SharedExpat.so
libAXE8SharedExpat.so(VERSION)
libAXSLE.so
libAXSLE.so
libAXSLE.so(VERSION)
libAXSLE.so(VERSION)
libAdobeXMP.so
libAdobeXMP.so
libAdobeXMP.so(VERSION)
libAdobeXMP.so(VERSION)
libBIB.so
libBIB.so(VERSION)
libBIB.so.1.2
libBIB.so.1.2(VERSION)
libBIBUtils.so
libBIBUtils.so(VERSION)
libBIBUtils.so.1.1
libBIBUtils.so.1.1(VERSION)
libCoolType.so
libCoolType.so(VERSION)
libCoolType.so.5.03
libCoolType.so.5.03(VERSION)
libJP2K.so
libJP2K.so
libJP2K.so(VERSION)
libResAccess.so
libResAccess.so(VERSION)
libResAccess.so.0.1
libWRServices.so
libWRServices.so(VERSION)
libWRServices.so.2.1
libadobelinguistic.so
libadobelinguistic.so
libadobelinguistic.so(VERSION)
libahclient.so
libahclient.so
libahclient.so(VERSION)
libcrypto.so.0.9.7
libcrypto.so.0.9.7
libcurl.so.3
libdatamatrixpmp.pmp
libextendscript.so
libextendscript.so
libgcc_s.so.1
libgcc_s.so.1(GCC_3.0)
libgcc_s.so.1(GCC_3.3)
libgcc_s.so.1(GCC_3.3.1)
libgcc_s.so.1(GCC_3.4)
libgcc_s.so.1(GCC_3.4.2)
libgcc_s.so.1(GCC_4.0.0)
libgcc_s.so.1(GLIBC_2.0)
libicudata.so.34
libicudata.so.34
libicui18n.so.34
libicuuc.so.34
libicuuc.so.34
libpdf417pmp.pmp
libqrcodepmp.pmp
librt3d.so
libsccore.so
libsccore.so
libssl.so.0.9.7
libssl.so.0.9.7
libstdc++.so.6
libstdc++.so.6(CXXABI_1.3)
libstdc++.so.6(CXXABI_1.3.1)
libstdc++.so.6(GLIBCXX_3.4)
libstdc++.so.6(GLIBCXX_3.4.1)
libstdc++.so.6(GLIBCXX_3.4.2)
libstdc++.so.6(GLIBCXX_3.4.3)
libstdc++.so.6(GLIBCXX_3.4.4)
libstdc++.so.6(GLIBCXX_3.4.5)
libstdc++.so.6(GLIBCXX_3.4.6)
libstdc++.so.6(GLIBCXX_3.4.7)
nppdf.so
prcr.x3d
tesselate.x3d
wwwlink.api
acroread = 8.1.2-1.slc4


Yep, thats right - RPM had decided that acroread was a dependency. Grr. Workaround - remirror the CERN SLC repo, (no they hadn't updated since), manually remove the offending rpm, and rebuild the metadata with 'createrepo'

Then make sure that the nodes were rebuilt and only ever looked to our local repository rather than the primary cern / dag ones. (thanks to kickstart %post and cfengine)

Finally we got yaim to *almost* run - it was failing on lcmaps and grid-mapfile creation (fixed by unsupporting a certain VO)
Easy fix in comparison. Anyway - DPM up n running and seems OK. Roll on the next SAM tests.... (or real users)

Phew.

Grid Middleware should not be this hard to install!

Durham - DPM v1.6.7, Space Tokens etc

Well this is the first of hopefully many posts from Durham. So firstly a quick update, well Durham seem have been ticking along nicely - with the exception of a few network and power outages over the last few months. SAM tests are passing and ATLAS, Pheno and many other VOs are successfully running jobs. Versions of the lcg software is a little out of date in places – but this is work in progress.

So with a little encouragement and help from Greig I finally took the plunge and upgraded our SE to DPM v1.6.7. After getting the yum repositories correct, it was a case of stopping the daemons, running yum update, making the DPM schema changes (we were upgrading from an old version of DPM), and then restarting the daemons... done... or so we thought!

Everything was working, file copies in and out of our SE, reserving space tokens, etc... the only gotcha was that we were publishing "GlueSAStateAvailableSpace: 0"... which wasn't true. After a little investigation, and with the help of Greig, we noticed that /opt/lcg/var/gip/plugin/lcg-info-dynamic-se was pointing to a beta version of lcg-info-dynamic-dpm. Changed this to remove the beta and bingo... all working correctly.

I have then setup to publish the space tokens as shown here here, and all is done.

A good days work, until a major network outage at the JANET/NorMAN level knocked us out most of the night... typical. We seem to have recovered now though so we should be back on track.

Thursday, February 21, 2008

Thusday Night Status Update

Quick summary of where we are right now:

* YAIM configuration updated and rationalised.

* Batch system has been upgraded to Torque 2.1.9/Maui 3.2.6, running on SL4 x86_64.

* Queues have been reduced to 4, open to most VOs, with queue lengths of 30m, 6h, 3d and 7d.

* CE has been moved to svr021, again running SL4 x86_64.

* Information system has been reconfigured to new gLite versions. After a minor wobble on the CE seems to be working just fine.

* DPM headnode has been upgraded to SL4 x86_64.

And job submission works:


*************************************************************
BOOKKEEPING INFORMATION:

Status info for the Job : https://svr022.gla.scotgrid.ac.uk:9000/sx54e7252PGtxJtB4Y2mIg
Current Status: Done (Success)
Exit code: 0
Status Reason: Job terminated successfully
Destination: svr021.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs-q30m
Submitted: Thu Feb 21 23:22:30 2008 GMT
*************************************************************

...

================================================================================

JOB GET OUTPUT OUTCOME

Output sandbox files for the job:
https://svr022.gla.scotgrid.ac.uk:9000/sx54e7252PGtxJtB4Y2mIg
have been successfully retrieved and stored in the directory:
/tmp/jobOutput/gla012_sx54e7252PGtxJtB4Y2mIg

================================================================================


And there was much rejoycing.

Significant work remains for tomorrow, but very good progress being made.

Glasgow Downtime

As announced on GOCDB We're taking UKI-SCOTGRID-GLASGOW down until 17:00 localtime tomorrow (friday) to bring forwards the maintenance we'd planned for March. This was due to an unexpected CE failurelast night that meant the queues were empty.

Tuesday, February 19, 2008

DPM and ATLAS Space Tokens

So, here's the definitive guide to enabling space tokens for ATLAS:
  1. Assign the token to atlas/Role=production using the dpm-reservespace command. (Example here.)
  2. Create the directory where DDM will put the files, which is the normal path for ATLAS, plus the name of the spacetoken in lower case. e.g.

    • dpns-mkdir /dpm/gla.scotgrid.ac.uk/home/atlas/atlasdatadisk

  3. Now change group ownership of this directory to atlas/Role=production, chmod it to 775 and finally add two ACLs which will mean the entire tree will be writable by production roles:


    • dpns-chgrp atlas/Role=production /dpm/gla.scotgrid.ac.uk/home/atlas/atlasdatadisk

    • dpns-chmod 775 /dpm/gla.scotgrid.ac.uk/home/atlas/atlasdatadisk

    • dpns-setacl -m d:g:atlas/Role=production:7,m:7 /dpm/gla.scotgrid.ac.uk/home/atlas/atlasdatadisk

    • dpns-setacl -m g:atlas/Role=production:7,m:7 /dpm/gla.scotgrid.ac.uk/home/atlas/atlasdatadisk


  4. Enable the space token publisher (this only needs to be done once - subsequent tokens are picked up automatically). Instructions here.

Friday, February 15, 2008

New gLite 3.0 WMS at Glasgow

The Glaswegians now have a gLite 3.0 workload management server running on an SL3 machine.

Installation was reasonably easy; points of note being that a gLite user must exist, and that the gLite-WMSLB meta-package does not install the gLite-yaim-lb package. 

Jobs submitted on the Glasgow cluster with glite-wms-job-submit will now automatically use the local WMS server. Instructions on how to submit jobs via the WMS are available here and here.

Thursday, February 14, 2008

DPM SRMv2 Tweaks

Got back from holiday to find that FDR transfers to Glasgow were failing with a "permission denied" error. Looking through the logs it seemed that the srmMkDir call was failing (unlike SRMv1 you need to make the path before transferring data into the SE).

However, it seems the default umask for DPM srmMkDir is 022, which leaves the directory unwritable. As I had tested the /dpm/gla.scotgrid.ac.uk/home/atlas/atlasdatadisk path I left it unwritable by Mario's certificate.

I fixed this with a dpns-chmod, but then I wrote a script to patch up the space token directory tree area using DPM ACLs to ensure the directories are group writable by the production role.

You can get the script here: http://www.physics.gla.ac.uk/~graeme/atlas/scripts/atlas-dpm-token-fix.sh.

/dev/scotgrid - Buffer Overflow

Want more Glasgow news? We've also started a local logbook so we should have a changelog for the system. Not that we didn't keep every small detail documented before anyway. Ahem.

Wednesday, February 13, 2008

ganglia gmond

During some testing for next months outage, we'd rebuilt several nodes. One thing I noticed was that the NAT boxes had stopped reporting into ganglia. We'd had something similar before with an older version of gmond ignoring the 'mcast_if' parameter (hey, the alternative is to set up the routing tables) - the clunky 'copy over a known newer binary' wasn't going to be sustainable and the sf.net download only had i386.

Howver, kudos to the ganglia developers - one stupidly simple 'rpmbuild' and lo, a pile of x86_64 rpms ready to be copied into the cluster repo directory. Some cfengine voodoo and zip - all diskservers (including the new shiny 48T box) and nat boxes are reporting in. some of the graphs took a wobble but we're all present and correct with 168 machines in the pool.

Friday, February 08, 2008

YUM Updated Mystery Solved

Upgrading the cluster recently has brought nothing but trouble, with seeming conflicts between python-devel and python. Hacking and slashing through this on the UI I have now realised why. The conflict comes not from the x86_64 python RPM, but from the extra i386 python RPM we install to provide a 32bt python (needed by LCG modules like the python lfc plugin).

To avoid having to present the whole i386 repo to YUM, we'd picked out a few choice i386 packages and dropped them into the local cluster repo. However, in the meantime the i386 python had been updated and our version remained old and stale - causing the unsatisfiable dependency.

I have now updated and rebuilt the repo and things seem to upgrade smoothly at last.

However, roll on ye native 64bit middleware. This is too much of a pain at times...

Tuesday, February 05, 2008

ATLAS Space Token Roles

Following a discussion with ATLAS people I've now clarified that SRMv2 writing will happen using the atlas/Role=production VOMS role.

Therefore sites should restrict access to the ATLASMCDISK and ATLASDATADISK space tokens to this role. To do this, release and then recreate the space reservation:

# dpm-releasespace --token_desc ATLASDATADISK
# dpm-reservespace --gspace 10T --lifetime Inf --group atlas/Role=production --token_desc ATLASDATADISK

Greig's python snippet to list spaces is very helpful.

Monday, February 04, 2008

cluster glue

hmm. Freudian? I originally typed 'cluster clue' as the title.

Regular readers will be aware that we run both ganglia and cfengine. However even our wonderful rebuld system (YPF) doesn't quite close off all the holes in the fabric monitoring. case in point - reimaged a few machines and noticed that ganglia wasn't quite right. It'd copied in the right gmond.conf for that group of machines but hadnt checked that it was listed in the main gmetad.conf as a data_source,

Cue a short Perl script (soon to be available on the scotgrid wiki) to do a sanity check, but it;s this sort of non-joinedupness of all the bits that really annoys me about clusters and distributed systems.

Are there any better tools? (is Quattor the savoiur for this type of problem)

/rant

Saturday, February 02, 2008

nagios event handlers

I've gone over to the Dark Side (no, not Python) and have just implemented my first nagios event handler - This *should* automatically try and fix the problem we have with our Dirvish backup scrips - namely that we end up with too many copies of the database dumps held.

So - cue nagios' event handlers - the only issue should be that nagios (and the event handler) runs as user nagios and most sysadmin stuff needs root. If you're willing to trust it to sudo then it should be OK.

Friday, February 01, 2008

DPM Storage Token Information Published

It was a bit fiddly, but with 4 fixes to Michel's script, Glasgow are publishing the space token information for ATLASDATADISK.

I documented the workarounds in the LCG Twiki.