Wednesday, January 28, 2009

WMS purging fixed...

Ever since we've had our WMSs installed at Glasgow, we've observed that job purging appears broken. What's supposed to happen is that, when a user retrieves their job's output, the associated sandbox on the WMS is cleaned out. However, users of the ScotGrid WMSs were seeing:

bash-3.00$ glite-wms-job-output https://svr023.gla.scotgrid.ac.uk:9000/IfNak9XhD80im39v5JVGNw

Connecting to the service https://svr023.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server

Warning - JobPurging not allowed
(The Operation is not allowed: Unable to complete job purge)

This ticket was raised and, eventually, we figured out that the WMSs need DN entries in /opt/glite/etc/LB-super-users relating to both WMSs. In addition to that, there's a bug which requires the DNs to be present in two slightly differing formats:

/C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=svr022.gla.scotgrid.ac.uk/emailAddress=grid-certificate@physics.gla.ac.uk
/C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=svr022.gla.scotgrid.ac.uk/Email=grid-certificate@physics.gla.ac.uk
/C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=svr023.gla.scotgrid.ac.uk/emailAddress=grid-certificate@physics.gla.ac.uk
/C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=svr023.gla.scotgrid.ac.uk/Email=grid-certificate@physics.gla.ac.uk

(compare emailAddress with Email)

Anyway, with these changes made (and a service gLite restart), the WMSs will now purge job output:

-bash-3.00$ glite-wms-job-output https://svr022.gla.scotgrid.ac.uk:9000/adlQbeXjpyURB3qpt-NQAA

Connecting to the service https://svr023.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server

================================================================================
JOB GET OUTPUT OUTCOME

Output sandbox files for the job:
https://svr022.gla.scotgrid.ac.uk:9000/adlQbeXjpyURB3qpt-NQAA
have been successfully retrieved and stored in the directory:
/tmp/jobOutput/mkenyon_adlQbeXjpyURB3qpt-NQAA
================================================================================

Wednesday, January 21, 2009

New Durham Cluster Provides 1M SI2k


After a few teething problems with power and cooling, the new Durham cluster has finally passed the acceptance testing phase of the tender and is in full operation.

The new cluster now provides 1 Million SI2k - greater than a factor of 10 increase on the old cluster! CPU usage by the pheno VO has increased and we have seen jobs from a number of VOs (atlas, lhcb, cms, biomed, ngs, snemo etc) - though work has still to be done to ensure atlas and lhcb production jobs are running successfully.

The new cluster consists of 3 new front end machines and 84 new worker nodes. Using twin-servers, two machines can be packed in a 1U server, providing huge CPU power in a small area. A total of 672 job slots are available to provide the 1 MSpecInts - with each worker node consisting of:

* Dual processor, quad core providing 8 cores per machine.
* Low-power Xeon L5430 for greater power efficiency and lower running costs.
* 16GB RAM per machine, providing 2GB per core.
* Dual bonded gigabit ethernet
* 0.5TB Hard Disk
* Installed with the Scientific Linux 4.7 OS

The cluster also proves 3 disk servers providing a total of approx 30TB of usable grid storage.

The management and functionality of the cluster has also improved dramatically with many front end machines moved to virtual machines. More information on this will follow in a separate blog.

Tuesday, January 20, 2009

Development / PreProd : The CE

So on with the show and the creation of a development/preprod cluster. Next up the CE on dev machine 10.

Step 1. Again involved a quick search of the cfagent.conf. An additional entry was added in the ce groups stanza to add in the devce. cfagent -qv was used to pull down/install the node.

The first yum update updated all the certificates was successful. However, it looks again like glite-yaim-core was not successful first time around.

/opt/glite/yaim/bin/yaim: No such file or directory

A second run of cfengine caused the error below.

Transaction Check Error: file /usr/share/java/jaf.jar conflicts between attempted installs of geronimo-jaf-1.0.2-api-1.2-11.jpp5 and sun-jaf-1.1-3jpp
file /usr/share/java/jaf_api.jar conflicts between attempted installs of geronimo-jaf-1.0.2-api-1.2-11.jpp5 and sun-jaf-1.1-3jpp

This is a known issue and the fix is to disable the jpackage17 repo like so:

/usr/bin/yum -y install lcg-CE glite-TORQUE_utils --disablerepo=jpackage17-generic

This allowed a full installation of the relevant packages. This was run by hand in the first instance and then was added to the development configuration file.

Step 2. Running yaim by hand for the CE and then adding additional stanzas into the development configuration file. So installing the CE: /opt/glite/yaim/bin/yaim -c -s /opt/glite/yaim/etc/site-info.def -n lcg-CE -n TORQUE_utils

There appeared to be various quirks/warnings/errors with this. Here is a short summary.

There appeared to be an issue with line 36 of the config_apel_pbs.
Atfer inspecting the file it appears that the line separator in the file was not working as expected.

/opt/glite/yaim/functions/config_apel_pbs: line 36: APEL_DB_PASSWORD: command not found

This is a known bug https://savannah.cern.ch/bugs/index.php?39014 and was temporarily fixed by removing the additional whitespace before the /

These errors appeared next but did not seem to cause any problems.

/sbin/ldconfig: /opt/glite/lib/libvomsc_gcc32dbg.so.0 is not a symbolic link
/sbin/ldconfig: /opt/glite/lib/libvomsapi_gcc32.so.0 is not a symbolic link
/sbin/ldconfig: /opt/glite/lib/libvomsc_gcc32pthr.so.0 is not a symbolic link
/sbin/ldconfig: /opt/glite/lib/libvomsapi_gcc32dbg.so.0 is not a symbolic link
/sbin/ldconfig: /opt/glite/lib/libvomsc.so.0 is not a symbolic link
/sbin/ldconfig: /opt/glite/lib/libvomsapi_gcc32pthr.so.0 is not a symbolic link
/sbin/ldconfig: /opt/glite/lib/libvomsapi_gcc32dbgpthr.so.0 is not a symbolic link
/sbin/ldconfig: /opt/glite/lib/libvomsc_gcc32dbgpthr.so.0 is not a symbolic link
/sbin/ldconfig: /opt/glite/lib/libvomsc_gcc32.so.0 is not a symbolic link
/sbin/ldconfig: /opt/glite/lib/libvomsapi.so.0 is not a symbolic link
/sbin/ldconfig: /opt/glite/lib/libvomsapi_nog.so.0 is not a symbolic link


This is due to a known bug: https://savannah.cern.ch/bugs/?42481

INFO: Now creating the grid-mapfile - this may take a few minutes...
voms search(https://voms.gridpp.ac.uk:8443/voms/supernemo.vo.eu-egee.org/Role=lcgadmin/services/VOMSCompatibility?method=getGridmapUsers): /voms/supernemo.vo.eu-egee.org/Role=lcgadmin/services/VOMSCompatibility

voms search(https://voms.gridpp.ac.uk:8443/voms/ukqcd.vo.gridpp.ac.uk/Role=lcgadmin/services/VOMSCompatibility?method=getGridmapUsers): /voms/ukqcd.vo.gridpp.ac.uk/Role=lcgadmin/services/VOMSCompatibility

Exit with error(s) (code=2)


WARNING: It looks like /opt/globus/tmp/gram_job_state may not be on a local filesystem. WARNING: The test for local file systems is not 100% reliable. Ignore the below if this is a false positive.
WARNING: The jobmanager requires state dir to be on a local filesystem
WARNING: Rerun the jobmanager setup script with the -state-dir= option.Creating state file directory.
Done.

find-fork-tools: WARNING: "Cannot locate mpiexec"
find-fork-tools: WARNING: "Cannot locate mpirun"

find-lcgpbs-tools: WARNING: "Cannot locate mpirun"
checking for mpirun... no

Any clues to these errors would be greatly appreciated too.

Currently when a new CE is configured yaim attempts to run the function config_gip_vo_tag
This function attempts to create and change permission on the VO tags directory.
However, this directory is mounted on the Scotgrid cluster and comes pre-configured so to speak.

INFO: Executing function: config_gip_vo_tag
chmod: changing permissions of `/opt/edg/var/info/atlas': Operation not permitted
chmod: changing permissions of `/opt/edg/var/info/atlas/atlas.list': Operation not permitted
chmod: changing permissions of `/opt/edg/var/info/cms': Operation not permitted
ERROR: Error during the execution of function: config_gip_vo_tag
ERROR: Error during the configuration.Exiting. [FAILED]
ERROR: One of the functions returned with error without specifying it's nature !
INFO: Using locally defined function /opt/glite/yaim/functions/local/config_gip_vo_tag
cfengine controls config_gip_vo_tag. yaim function disabled.

The fix to this issue was to override the function gip_vo_tag in the yaim/function/local directory to make sure it did not try to change any of the NFS mounted directories.

Again the bdii issue reared it ugly head. https://savannah.cern.ch/bugs/index.php?40675

Starting glite-lb-interlogd ...chown: cannot access `/opt/bdii/var': No such file or directory
sed: can't read /opt/bdii/etc/schemas: No such file or directory

The file /opt/bdii/etc/schemas was missing. The fix is to copy the /opt/bdii/doc/schemas.example file to /opt/bdii/etc/schemas and re-run yaim.
The re-run of yaim also fixes the chown of /opt/bdii/var.

All of the above workarounds have now been added to the cf.dev script that is included in the main cfagent.conf script. Thus all the temporary workarounds and development stanzas are kept out of the main script.

Step 3 - Testing:

First off I thought I could use the glite set of commands for the development cluster.

jdl extract:

Requirements = other.GlueCEUniqueID == "devmachine10.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs-q30m";

submission extract:

-bash-3.00$ glite-wms-job-list-match -a hello.jdl
Connecting to the service https://devmachine9.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server

==================== glite-wms-job-list-match failure ====================
No Computing Element matching your job requirements has been found!
==========================================================================

However, after some initial tests and some thought on the subject it became apparent that it would have to be entered into the site BDII for this to work.
This made sense since the WMS queries the site BDII to get information relating to the published queues from the CE.

Therefore, without setting up another siteBDII/BDII for the mini cluster, direct job submission via GLOBUS seemed like the way to go for intial testing.

-bash-3.00$ globus-job-submit devmachine10.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs /bin/hostname -f
https://devmachine10.gla.scotgrid.ac.uk:35001/25359/1232379693/
-bash-3.00$ globus-job-status https://devmachine10.gla.scotgrid.ac.uk:35001/25359/1232379693/
PENDING
-bash-3.00$ globus-job-status https://devmachine10.gla.scotgrid.ac.uk:35001/25359/1232379693/
PENDING
-bash-3.00$ globus-job-status https://devmachine10.gla.scotgrid.ac.uk:35001/25359/1232379693/
DONE

However when I tried to obtain the job output: -bash-3.00$ globus-job-get-output https://devmachine10.gla.scotgrid.ac.uk:35001/25359/1232379693/

--- Nothing, Nada! Doh!

After some investigation involving the logs on the CE and the logs on torque it became apparent that the torque was not allowing the job submission from the new CE. After some more investigation this seemed to be down to a file called hosts.equiv. This is file that holds the white list of hosts that torque will talk to.

Therefore, after added the new CE and restarting Torque:

-bash-3.00$ globus-job-run devmachine10.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs "/bin/hostname -f"
node192.beowulf.cluster

Woohoo, we have a working CE that allows submission through Globus, Now to test the CE with WMS submission we need a dev site bdii. So I think I will install on the dev UI and make sure we can submit through glite-wms-job-submit.

Thursday, January 15, 2009

Development / PreProd : The WMS

I thought I would continue my foray into grid middleware installations another quick blog on the workload management system or WMS as its affectionately known. . With the old development cluster very much moved/dead we now have dev008 -> dev013 as a sandbox for installs and upgrades. So after the UI install last year the next piece of the jigsaw was the slightly more heavy weight WMS.

Step 1. Was a quick search of the cfagent.conf. This returned the necessary files, links, packages to install. An additional entry was added in the wms groups stanza and cfengine was ran immediately using cfagent -qv

The first yum update in fact updated all the certificates from the lcg-CA, lcg-vomscerts package stanzas and these were successful. It then looked like it attempted to run the glite-yaim-core which could not have successful as it then through this error:

Executing script /opt/glite/yaim/bin/yaim -c -s /opt/glite/yaim/etc/site-info.def -n glite-WMS -n glite-LB...(timeout=0,uid=-1,gid=-1)
(Setting umask to 22)
cfengine:dev009:m/bin/yaim -c -: sh: /opt/glite/yaim/bin/yaim: No such file or directory

This looks like on the first run after a complete re-install it trys to runs yaim without installing yaim first or infact actually running the WMS package stanzas. Slightly bizarre.

However, another run of cfagent -qv seemed to work as this correctly ran through the WMS package stanzas of glite-WMS, glite-LB, rbwmsmon. Although in the spirit of grid wierdness there were a few warnings when it installed Condor just to make you wonder what was going on!

Installing: condor ##################### [ 58/105]WARNING: Multiple network interfaces detected. Condor might not work
cfengine:dev009: properly until you set NETWORK_INTERFACE =
cfengine:dev009: Unable to find a valid Java installation
cfengine:dev009: Java Universe will not work properly until the JAVA
cfengine:dev009: (and JAVA_MAXHEAP_ARGUMENT) parameters are set in the configuration file!
cfengine:dev009: Condor has been installed into:
cfengine:dev009: /opt/condor-6.8.4
cfengine:dev009: In order for Condor to work properly you must set your
cfengine:dev009: CONDOR_CONFIG environment variable to point to your
cfengine:dev009: Condor configuration file:
cfengine:dev009: /opt/condor-6.8.4/etc/condor_config
cfengine:dev009: before running Condor commands/daemons.
cfengine:dev009:

After some internet searching and logging in/out I could see that CONDOR_CONFIG was actually set correctly.

dev009:~# echo $CONDOR_CONFIG
/opt/condor-c/etc/condor_config

Step 2. Install yaim by hand. Then blow dev009 away and let cfengine do the whole lot.

Before running yaim I made sure that dev009 was included in the site-info.def. I first created a file in the node directory within yaim to override the current production WMS and this was okay to configure the WMS first time around. However, on successive runs of cfagent it overwrote the site-info.def - doh! Therefore, Mike suggested having a development site-info.def which is copied over the current production each time cfagent is run. This worked a treat.

So on with the show with yaim for the WMS: /opt/glite/yaim/bin/yaim -c -s /opt/glite/yaim/etc/site-info.def -n glite-WMS -n glite-LB

Although yaim appeared to run successfully. There were a few warning/errors which had to be explained.

cfengine:dev009:m/bin/yaim -c -: WARNING: Only 1 pool account defined for tag 'sgm' of VO VO.PANDA.GSI.DE
cfengine:dev009:m/bin/yaim -c -: users_getspecialgroup: could not find 'sgm' user for VO BIOMED in /opt/glite/yaim/etc/users.conf
cfengine:dev009:m/bin/yaim -c -: users_getspecialprefix: could not find 'sgm' prefix for BIOMED in /opt/glite/yaim/etc/users.conf
cfengine:dev009:m/bin/yaim -c -: users_getspecialgroup: could not find 'prd' user for VO BIOMED in /opt/glite/yaim/etc/users.conf
cfengine:dev009:m/bin/yaim -c -: users_getspecialprefix: could not find 'prd' prefix for BIOMED in /opt/glite/yaim/etc/users.conf
cfengine:dev009:m/bin/yaim -c -: INFO: users_getspecialusers: could not find sgm user for VO BIOMED in /opt/glite/yaim/etc/users.conf
cfengine:dev009:m/bin/yaim -c -: ERROR: Could not determine mapping for tag 'sgm' of VO BIOMED
cfengine:dev009:m/bin/yaim -c -: INFO: users_getspecialusers: could not find prd user for VO BIOMED in /opt/glite/yaim/etc/users.conf
cfengine:dev009:m/bin/yaim -c -: ERROR: Could not determine mapping for tag 'prd' of VO BIOMED
cfengine:dev009:m/bin/yaim -c -: WARNING: No mapping found for "/biomed/Role=lcgadmin" in /tmp/yaim.vF5821
cfengine:dev009:m/bin/yaim -c -: WARNING: No mapping found for "/biomed/Role=production" in /tmp/yaim.vF5821

These are normal errors/warning and can be explained as the Panda VO does indeed have only one sgm account since there is only one panda account -Dan! The Biomed VO conversely has no sgm, prd or admin accounts and only run as plain users. Therefore, these messages are expected. The next warning was slightly more worrying since these undefined variables could be required.

cfengine:dev009:m/bin/yaim -c -: [Fri Dec 19 13:35:50 2008] [warn] PassEnv variable GLITE_WMS_WMPROXY_WEIGHTS_UPPER_LIMIT was undefined
cfengine:dev009:m/bin/yaim -c -: [Fri Dec 19 13:35:50 2008] [warn] PassEnv variable GLITE_SD_VO was undefined

After some searching on the web this is a warning and yaim will just use the default values. There also seemed to be some issue with the schema file in the WMS BDII section of yaim:

cfengine:dev009:m/bin/yaim -c -: Starting glite-lb-interlogd ...chown: cannot access `/opt/bdii/var': No such file or directory
cfengine:dev009:m/bin/yaim -c -: sed: can't read /opt/bdii/etc/schemas: No such file or directory

The file /opt/bdii/etc/schemas was missing. The fix is to copy the /opt/bdii/doc/schemas.example file to /opt/bdii/etc/schemas and re-run yaim.
The re-run of yaim also fixes the chown of /opt/bdii/var.

Step 3. Testing from dev008 the development UI yielded:

*************************************************************
BOOKKEEPING INFORMATION:

Status info for the Job : https://dev009.gla.scotgrid.ac.uk:9000/d0nTvt6udraqpjs0Mx-eOw
Current Status: Done (Success)
Exit code: 0
Status Reason: Job terminated successfully
Destination: svr021.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs-q30m
Submitted: Thu Jan 15 17:19:59 2009 GMT
*************************************************************


Therefore, I conclude: we have a working development WMS on dev009

Trac now secured

I have been playing with Trac in my spare time off and on and Mike had noticed quite a few sensitive files lying around in the repo. So I have secured the Scotgrid Trac instance so you now required to sign in. This will then allow you to browse the repo as you normally would. The user/pass can be obtained from anyone here at Scotgrid.

now to get it externally accessible!

Wednesday, January 14, 2009

DNS goes wibble wobble...

Various funny things were happening today:
  • General sickness in the atlas pilot factory.
  • Quite a few BDII dropouts.
  • SAM test failures from the above.
  • Sluggish clients on our UIs.
  • Very slow logins from CERN.
All things that pointed towards a slow/failing DNS. When I wrote a little test script with 30 forward/reverse DNS queries it was taking 20-50s on some servers and 0.5s on others.

The slow ones had been configured to look at a dnsmasq cache on our headnode, which for unknown reasons was going very slowly (even a restart did not help).

I reconfigured to take out the cache and suddenly all was rosy again across the cluster.

Curiously we had added the cache to overcome problems with campus DNS in the first place.

At least with things configured via cfengine this is a very easy change to make right across the cluster.

Tuesday, January 13, 2009

Source Control with Subversion Resurrection

With forthcoming SA4 tasks to update the scotgrid website and create a local registery of user information we have decided to resurrect the use of subversion as a source control system.

As we all know source control is basic software development practice and is jolly good idea for anything a team of developers want amend on a regular basis. So here it is being used once more on grid01. The current repo is located at /SVN.

The new project for the scotgrid website is scotgrid_www and can be checked out from the usual svn co svn+ssh command or any other svn gui tool. We plan to use this repo for all fabric management scripts. There is already a scotgrid repo which contains similar scripts but this is very out of date and will probably be blown away at some point.

Trac has been installed on grid01 (only accessible from the internal network at present - more to follow on this one ) and is running using the simple trac webserver. I have written a simple daemon to start/stop the service.

This gives a nice http interface to the repo to view files and trac changes. There is also a wiki/project management/ticket management facilities.

I will look to get the backing up of the repo automated in some way.