Thursday, January 15, 2009

Development / PreProd : The WMS

I thought I would continue my foray into grid middleware installations another quick blog on the workload management system or WMS as its affectionately known. . With the old development cluster very much moved/dead we now have dev008 -> dev013 as a sandbox for installs and upgrades. So after the UI install last year the next piece of the jigsaw was the slightly more heavy weight WMS.

Step 1. Was a quick search of the cfagent.conf. This returned the necessary files, links, packages to install. An additional entry was added in the wms groups stanza and cfengine was ran immediately using cfagent -qv

The first yum update in fact updated all the certificates from the lcg-CA, lcg-vomscerts package stanzas and these were successful. It then looked like it attempted to run the glite-yaim-core which could not have successful as it then through this error:

Executing script /opt/glite/yaim/bin/yaim -c -s /opt/glite/yaim/etc/site-info.def -n glite-WMS -n glite-LB...(timeout=0,uid=-1,gid=-1)
(Setting umask to 22)
cfengine:dev009:m/bin/yaim -c -: sh: /opt/glite/yaim/bin/yaim: No such file or directory

This looks like on the first run after a complete re-install it trys to runs yaim without installing yaim first or infact actually running the WMS package stanzas. Slightly bizarre.

However, another run of cfagent -qv seemed to work as this correctly ran through the WMS package stanzas of glite-WMS, glite-LB, rbwmsmon. Although in the spirit of grid wierdness there were a few warnings when it installed Condor just to make you wonder what was going on!

Installing: condor ##################### [ 58/105]WARNING: Multiple network interfaces detected. Condor might not work
cfengine:dev009: properly until you set NETWORK_INTERFACE =
cfengine:dev009: Unable to find a valid Java installation
cfengine:dev009: Java Universe will not work properly until the JAVA
cfengine:dev009: (and JAVA_MAXHEAP_ARGUMENT) parameters are set in the configuration file!
cfengine:dev009: Condor has been installed into:
cfengine:dev009: /opt/condor-6.8.4
cfengine:dev009: In order for Condor to work properly you must set your
cfengine:dev009: CONDOR_CONFIG environment variable to point to your
cfengine:dev009: Condor configuration file:
cfengine:dev009: /opt/condor-6.8.4/etc/condor_config
cfengine:dev009: before running Condor commands/daemons.
cfengine:dev009:

After some internet searching and logging in/out I could see that CONDOR_CONFIG was actually set correctly.

dev009:~# echo $CONDOR_CONFIG
/opt/condor-c/etc/condor_config

Step 2. Install yaim by hand. Then blow dev009 away and let cfengine do the whole lot.

Before running yaim I made sure that dev009 was included in the site-info.def. I first created a file in the node directory within yaim to override the current production WMS and this was okay to configure the WMS first time around. However, on successive runs of cfagent it overwrote the site-info.def - doh! Therefore, Mike suggested having a development site-info.def which is copied over the current production each time cfagent is run. This worked a treat.

So on with the show with yaim for the WMS: /opt/glite/yaim/bin/yaim -c -s /opt/glite/yaim/etc/site-info.def -n glite-WMS -n glite-LB

Although yaim appeared to run successfully. There were a few warning/errors which had to be explained.

cfengine:dev009:m/bin/yaim -c -: WARNING: Only 1 pool account defined for tag 'sgm' of VO VO.PANDA.GSI.DE
cfengine:dev009:m/bin/yaim -c -: users_getspecialgroup: could not find 'sgm' user for VO BIOMED in /opt/glite/yaim/etc/users.conf
cfengine:dev009:m/bin/yaim -c -: users_getspecialprefix: could not find 'sgm' prefix for BIOMED in /opt/glite/yaim/etc/users.conf
cfengine:dev009:m/bin/yaim -c -: users_getspecialgroup: could not find 'prd' user for VO BIOMED in /opt/glite/yaim/etc/users.conf
cfengine:dev009:m/bin/yaim -c -: users_getspecialprefix: could not find 'prd' prefix for BIOMED in /opt/glite/yaim/etc/users.conf
cfengine:dev009:m/bin/yaim -c -: INFO: users_getspecialusers: could not find sgm user for VO BIOMED in /opt/glite/yaim/etc/users.conf
cfengine:dev009:m/bin/yaim -c -: ERROR: Could not determine mapping for tag 'sgm' of VO BIOMED
cfengine:dev009:m/bin/yaim -c -: INFO: users_getspecialusers: could not find prd user for VO BIOMED in /opt/glite/yaim/etc/users.conf
cfengine:dev009:m/bin/yaim -c -: ERROR: Could not determine mapping for tag 'prd' of VO BIOMED
cfengine:dev009:m/bin/yaim -c -: WARNING: No mapping found for "/biomed/Role=lcgadmin" in /tmp/yaim.vF5821
cfengine:dev009:m/bin/yaim -c -: WARNING: No mapping found for "/biomed/Role=production" in /tmp/yaim.vF5821

These are normal errors/warning and can be explained as the Panda VO does indeed have only one sgm account since there is only one panda account -Dan! The Biomed VO conversely has no sgm, prd or admin accounts and only run as plain users. Therefore, these messages are expected. The next warning was slightly more worrying since these undefined variables could be required.

cfengine:dev009:m/bin/yaim -c -: [Fri Dec 19 13:35:50 2008] [warn] PassEnv variable GLITE_WMS_WMPROXY_WEIGHTS_UPPER_LIMIT was undefined
cfengine:dev009:m/bin/yaim -c -: [Fri Dec 19 13:35:50 2008] [warn] PassEnv variable GLITE_SD_VO was undefined

After some searching on the web this is a warning and yaim will just use the default values. There also seemed to be some issue with the schema file in the WMS BDII section of yaim:

cfengine:dev009:m/bin/yaim -c -: Starting glite-lb-interlogd ...chown: cannot access `/opt/bdii/var': No such file or directory
cfengine:dev009:m/bin/yaim -c -: sed: can't read /opt/bdii/etc/schemas: No such file or directory

The file /opt/bdii/etc/schemas was missing. The fix is to copy the /opt/bdii/doc/schemas.example file to /opt/bdii/etc/schemas and re-run yaim.
The re-run of yaim also fixes the chown of /opt/bdii/var.

Step 3. Testing from dev008 the development UI yielded:

*************************************************************
BOOKKEEPING INFORMATION:

Status info for the Job : https://dev009.gla.scotgrid.ac.uk:9000/d0nTvt6udraqpjs0Mx-eOw
Current Status: Done (Success)
Exit code: 0
Status Reason: Job terminated successfully
Destination: svr021.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs-q30m
Submitted: Thu Jan 15 17:19:59 2009 GMT
*************************************************************


Therefore, I conclude: we have a working development WMS on dev009

No comments: