Friday, August 29, 2008

nanocmos + lcas = FAIL

While working on an unrelated issue on svr021 I noticed an edg-mkgridmap error in the logfile

Aug 29 05:28:14 svr021 edg-mkgridmap[6693]: voms search(https://svr029.gla.scotgrid.ac.uk:8443/voms/vo.scotgrid.ac.uk/services/VOMSCompatibility?method=getGridmapUsers): Internal Server Error

Mentioned to mike who promptly went and fixed the issue, only to discover 30 mins later we're failing SAM tests - LCAS voms plugin had once again gone fubar and caused globus-gatekeeper to segfault

Aug 29 12:14:41 svr021 GRAM gatekeeper[662]: Authenticated globus user: [DN REMOVED]
Aug 29 12:14:41 svr021 GRAM gatekeeper[663]: Authenticated globus user: [DN REMOVED] Aug 29 12:14:41 svr021 kernel: globus-gatekeep[662]: segfault at 0000000000000046 rip 0000000000b86259 rsp 00000000ffff9d98 error 4
Aug 29 12:14:41 svr021 kernel: globus-gatekeep[663]: segfault at 0000000000000046 rip 0000000000b86259 rsp 00000000ffff9d98 error 4


the globus gatekeeper log has a bit more info:
TIME: Fri Aug 29 12:14:41 2008
PID: 663 -- Notice: 5: Authenticated globus user: [DN REMOVED]
lcas client name: [DN REMOVED]
LCAS 0:
LCAS 1: Initialization LCAS version 1.3.7
allowing empty credentials
LCAS 2: LCAS authorization request
LCAS 0: lcas_userban.mod-plugin_confirm_authorization(): checking banned users in /opt/glite/etc/lcas/ban_users.db
LCAS 0: lcas_plugin_voms-plugin_confirm_authorization_from_x509(): Did not find a matching VO entry in the authorization file
LCAS 0: 2008-08-29.12:14:41 : lcas_plugin_voms-plugin_confirm_authorization_from_x509(): voms plugin failed
LCAS 0: lcas.mod-lcas_run_va(): authorization failed for plugin /opt/glite/lib/modules/lcas_voms.mod
LCAS 0: lcas.mod-lcas_run_va(): failed
LCAS 0: lcas_plugin_voms-plugin_confirm_authorization_from_x509(): Did not find a matching VO entry in the authorization file
LCAS 0: 2008-08-29.12:14:41 : lcas_plugin_voms-plugin_confirm_authorization_from_x509(): voms plugin failed
LCAS 0: lcas.mod-lcas_run_va(): authorization failed for plugin /opt/glite/lib/modules/lcas_voms.mod
LCAS 0: lcas.mod-lcas_run_va(): failed
JMA 2008/08/29 12:14:45 GATEKEEPER_JM_ID 2008-08-29.11:14:39.0000014519.0000000000 JM exiting

As before, commenting out the lcas_voms.mod in /opt/glite/etc/lcas/lcas.db allows it to work, at the expense of losing VOMS roles.

We've got it working using the voms_mod at the moment by altering the ACLs on the VOMS server (svr029) for nanocmos. Now to try and debug the lcas plugin failure

Wednesday, August 27, 2008

CE-sft-lcg-rm-rep fail

Ho Hum - After sorting out the gatekeeper, we still get a sam fail. Wait a minute...

Checking replication to Central SE (lxdpm101.cern.ch)

Replicate the file from the default SE to lxdpm101.cern.ch

+ lcg-rep -v --vo ops -d lxdpm101.cern.ch lfn:sft-lcg-rm-cr-node114.beowulf.cluster.080827052321.475531
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
Source SE type: SRMv1
Destination SE type: SRMv1
httpg://lxdpm101.cern.ch:8443/srm/managerv1: No space left on device
lcg_rep: No space left on device
+ result=1
+ set +x

No space on device? grr - don't these people have monitoring ;-)

gatekeeper AWOL

Glasgow suffered 3-4 hours CE outage this evening as the globus-gatekeeper on svr021 had gone AWOL. we suffered a few SAM tests before I twigged that the 'connection refused' was coming from our end - 'service globus-gatekeeper restart' nobbled that but not until we'd failed 7 sam tests. Damn.

Monday, August 25, 2008

Glasgow on the move

Because of the current problems at RAL, Glasgow was nominated as a test peripatetic Tier-2, so see how agile ATLAS production was at moving Tier-2 resources in case of T1 downtime (note this test only works if the cloud services, FTS and LFC, are still running - if these are gone then it's almost impossible, today, to use any of the cloud's Tier-2s).

First off Glasgow was sent South East, into the NL cloud. Here we found a problem with the input datasets, because input datasets to T2s (which are subscribed without sources) only look for sources within the cloud (this follows the ATLAS computing model). However, the way around this is to specify the associated T1 (for production) as the source and then DQ2 does the work. The panda developers made the change on Friday, so that NIKHEF was specified as the source for inputs to Glasgow. Likewise, for output back to the NL T1, Glasgow's PRODDISK token was specified explicity as the source.

That done, Glasgow galloped through a couple of hundred jobs for the NL cloud, before they ran out of jobs:

Flushed with this success we've just shoved Glasgow into the FR cloud for a while, as they still have jobs left to run. Within an hour we're running a couple of 100 jobs.

SAM Failures across scotgrid: Someone else's problem

All 3 scotgrid sites have just failed the atlas SAM SE tests (atlas_cr, atlas_cp, atlas_del) as have quite alot of the rest of the UKI-* sites.

Once again this isn't a Tier-2 issue but an upstream problem with the tests themselves


ATLAS specific test launched from monb003.cern.ch
Checking if a file can be copied and registered to svr018.gla.scotgrid.ac.uk

------------------------- NEW ----------------
srm://svr018.gla.scotgrid.ac.uk/dpm/gla.scotgrid.ac.uk/home/atlas/
+ lcg-cr -v --vo atlas file:/home/samatlas/.same/SE/testFile.txt -l lfn:SE-lcg-cr-svr018.gla.scotgrid.ac.uk-1219649438 -d srm://svr018.gla.scotgrid.ac.uk/dpm/gla.scotgrid.ac.uk/home/atlas/SAM/SE-lcg-cr-svr018.gla.scotgrid.ac.uk-1219649438
Using grid catalog type: lfc
Using grid catalog : lfc0448.gridpp.rl.ac.uk
Using LFN : /grid/atlas/dq2/SAM/SE-lcg-cr-svr018.gla.scotgrid.ac.uk-1219649438
[BDII] sam-bdii.cern.ch:2170: Can't contact LDAP server
lcg_cr: Host is down
+ out_error=1
+ set +x
-------------------- Other endpoint same host -----------