Tuesday, January 20, 2009

Development / PreProd : The CE

So on with the show and the creation of a development/preprod cluster. Next up the CE on dev machine 10.

Step 1. Again involved a quick search of the cfagent.conf. An additional entry was added in the ce groups stanza to add in the devce. cfagent -qv was used to pull down/install the node.

The first yum update updated all the certificates was successful. However, it looks again like glite-yaim-core was not successful first time around.

/opt/glite/yaim/bin/yaim: No such file or directory

A second run of cfengine caused the error below.

Transaction Check Error: file /usr/share/java/jaf.jar conflicts between attempted installs of geronimo-jaf-1.0.2-api-1.2-11.jpp5 and sun-jaf-1.1-3jpp
file /usr/share/java/jaf_api.jar conflicts between attempted installs of geronimo-jaf-1.0.2-api-1.2-11.jpp5 and sun-jaf-1.1-3jpp

This is a known issue and the fix is to disable the jpackage17 repo like so:

/usr/bin/yum -y install lcg-CE glite-TORQUE_utils --disablerepo=jpackage17-generic

This allowed a full installation of the relevant packages. This was run by hand in the first instance and then was added to the development configuration file.

Step 2. Running yaim by hand for the CE and then adding additional stanzas into the development configuration file. So installing the CE: /opt/glite/yaim/bin/yaim -c -s /opt/glite/yaim/etc/site-info.def -n lcg-CE -n TORQUE_utils

There appeared to be various quirks/warnings/errors with this. Here is a short summary.

There appeared to be an issue with line 36 of the config_apel_pbs.
Atfer inspecting the file it appears that the line separator in the file was not working as expected.

/opt/glite/yaim/functions/config_apel_pbs: line 36: APEL_DB_PASSWORD: command not found

This is a known bug https://savannah.cern.ch/bugs/index.php?39014 and was temporarily fixed by removing the additional whitespace before the /

These errors appeared next but did not seem to cause any problems.

/sbin/ldconfig: /opt/glite/lib/libvomsc_gcc32dbg.so.0 is not a symbolic link
/sbin/ldconfig: /opt/glite/lib/libvomsapi_gcc32.so.0 is not a symbolic link
/sbin/ldconfig: /opt/glite/lib/libvomsc_gcc32pthr.so.0 is not a symbolic link
/sbin/ldconfig: /opt/glite/lib/libvomsapi_gcc32dbg.so.0 is not a symbolic link
/sbin/ldconfig: /opt/glite/lib/libvomsc.so.0 is not a symbolic link
/sbin/ldconfig: /opt/glite/lib/libvomsapi_gcc32pthr.so.0 is not a symbolic link
/sbin/ldconfig: /opt/glite/lib/libvomsapi_gcc32dbgpthr.so.0 is not a symbolic link
/sbin/ldconfig: /opt/glite/lib/libvomsc_gcc32dbgpthr.so.0 is not a symbolic link
/sbin/ldconfig: /opt/glite/lib/libvomsc_gcc32.so.0 is not a symbolic link
/sbin/ldconfig: /opt/glite/lib/libvomsapi.so.0 is not a symbolic link
/sbin/ldconfig: /opt/glite/lib/libvomsapi_nog.so.0 is not a symbolic link


This is due to a known bug: https://savannah.cern.ch/bugs/?42481

INFO: Now creating the grid-mapfile - this may take a few minutes...
voms search(https://voms.gridpp.ac.uk:8443/voms/supernemo.vo.eu-egee.org/Role=lcgadmin/services/VOMSCompatibility?method=getGridmapUsers): /voms/supernemo.vo.eu-egee.org/Role=lcgadmin/services/VOMSCompatibility

voms search(https://voms.gridpp.ac.uk:8443/voms/ukqcd.vo.gridpp.ac.uk/Role=lcgadmin/services/VOMSCompatibility?method=getGridmapUsers): /voms/ukqcd.vo.gridpp.ac.uk/Role=lcgadmin/services/VOMSCompatibility

Exit with error(s) (code=2)


WARNING: It looks like /opt/globus/tmp/gram_job_state may not be on a local filesystem. WARNING: The test for local file systems is not 100% reliable. Ignore the below if this is a false positive.
WARNING: The jobmanager requires state dir to be on a local filesystem
WARNING: Rerun the jobmanager setup script with the -state-dir= option.Creating state file directory.
Done.

find-fork-tools: WARNING: "Cannot locate mpiexec"
find-fork-tools: WARNING: "Cannot locate mpirun"

find-lcgpbs-tools: WARNING: "Cannot locate mpirun"
checking for mpirun... no

Any clues to these errors would be greatly appreciated too.

Currently when a new CE is configured yaim attempts to run the function config_gip_vo_tag
This function attempts to create and change permission on the VO tags directory.
However, this directory is mounted on the Scotgrid cluster and comes pre-configured so to speak.

INFO: Executing function: config_gip_vo_tag
chmod: changing permissions of `/opt/edg/var/info/atlas': Operation not permitted
chmod: changing permissions of `/opt/edg/var/info/atlas/atlas.list': Operation not permitted
chmod: changing permissions of `/opt/edg/var/info/cms': Operation not permitted
ERROR: Error during the execution of function: config_gip_vo_tag
ERROR: Error during the configuration.Exiting. [FAILED]
ERROR: One of the functions returned with error without specifying it's nature !
INFO: Using locally defined function /opt/glite/yaim/functions/local/config_gip_vo_tag
cfengine controls config_gip_vo_tag. yaim function disabled.

The fix to this issue was to override the function gip_vo_tag in the yaim/function/local directory to make sure it did not try to change any of the NFS mounted directories.

Again the bdii issue reared it ugly head. https://savannah.cern.ch/bugs/index.php?40675

Starting glite-lb-interlogd ...chown: cannot access `/opt/bdii/var': No such file or directory
sed: can't read /opt/bdii/etc/schemas: No such file or directory

The file /opt/bdii/etc/schemas was missing. The fix is to copy the /opt/bdii/doc/schemas.example file to /opt/bdii/etc/schemas and re-run yaim.
The re-run of yaim also fixes the chown of /opt/bdii/var.

All of the above workarounds have now been added to the cf.dev script that is included in the main cfagent.conf script. Thus all the temporary workarounds and development stanzas are kept out of the main script.

Step 3 - Testing:

First off I thought I could use the glite set of commands for the development cluster.

jdl extract:

Requirements = other.GlueCEUniqueID == "devmachine10.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs-q30m";

submission extract:

-bash-3.00$ glite-wms-job-list-match -a hello.jdl
Connecting to the service https://devmachine9.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server

==================== glite-wms-job-list-match failure ====================
No Computing Element matching your job requirements has been found!
==========================================================================

However, after some initial tests and some thought on the subject it became apparent that it would have to be entered into the site BDII for this to work.
This made sense since the WMS queries the site BDII to get information relating to the published queues from the CE.

Therefore, without setting up another siteBDII/BDII for the mini cluster, direct job submission via GLOBUS seemed like the way to go for intial testing.

-bash-3.00$ globus-job-submit devmachine10.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs /bin/hostname -f
https://devmachine10.gla.scotgrid.ac.uk:35001/25359/1232379693/
-bash-3.00$ globus-job-status https://devmachine10.gla.scotgrid.ac.uk:35001/25359/1232379693/
PENDING
-bash-3.00$ globus-job-status https://devmachine10.gla.scotgrid.ac.uk:35001/25359/1232379693/
PENDING
-bash-3.00$ globus-job-status https://devmachine10.gla.scotgrid.ac.uk:35001/25359/1232379693/
DONE

However when I tried to obtain the job output: -bash-3.00$ globus-job-get-output https://devmachine10.gla.scotgrid.ac.uk:35001/25359/1232379693/

--- Nothing, Nada! Doh!

After some investigation involving the logs on the CE and the logs on torque it became apparent that the torque was not allowing the job submission from the new CE. After some more investigation this seemed to be down to a file called hosts.equiv. This is file that holds the white list of hosts that torque will talk to.

Therefore, after added the new CE and restarting Torque:

-bash-3.00$ globus-job-run devmachine10.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs "/bin/hostname -f"
node192.beowulf.cluster

Woohoo, we have a working CE that allows submission through Globus, Now to test the CE with WMS submission we need a dev site bdii. So I think I will install on the dev UI and make sure we can submit through glite-wms-job-submit.

No comments: