Friday, August 28, 2009

multiple WMS yaim problems

I was alerted today by our all new shiny jabber chatroom that we were publishing the same WMS via lcg-infosites. A quick check and there it was....

-bash-3.00$ lcg-infosites --vo camont wms
https://svr022.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server
https://svr022.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server
https://lcgwms03.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server
https://wms00.hep.ph.ic.ac.uk:7443/glite_wms_wmproxy_server

We recently moved to the latest WMS 3.1 release last week and I thought it may have been down to that. Upon further inspection I found the following GIP plugin:

svr023:/opt/glite/etc/gip/provider# ./glite-info-provider-service-wmproxy-wrapper

which was publishing the wrong WMS.

GlueServiceEndpoint: https://svr022.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server

In the file the problem was an obvious one:

export WMPROXY_HOST=svr022.gla.scotgrid.ac.uk svr023.gla.scotgrid.ac.uk

This begs the questios, can YAIM deal with more than one WMS and if so how do you specify them. We had always gone for a quoted, space separated list in site-info.def
i.e.

WMS_HOST="svr022.$MY_DOMAIN svr023.$MY_DOMAIN"

but perhaps you can't do that any more and you need to override the WMS_HOST in a node specific way. Oh well.

Thursday, August 13, 2009

Database backups, and lock time

Running a service creates data. Running a service for a long time creates lots of data.

In this case, the WMS and LB servers - we're sitting with about 18GB on each LB. This is not a problem - they're well indexed against the usual queries (out of the box, no fiddling required), so the old data isn't really noticed.

Until you take a backup.

Then, in order to get a consistant backup, it's locked for however long it takes to dump all that data. Which is about 45 minutes.

That's too long - it means we have some time when it's not available, it's getting noticed. So, how can we take a backup, without locking the database for so long?

There's various options for that, but the best looking (read: simplest) one is to enable binary logging in MySQL. Because the tables used are all InnoDB, which is transactional, this means that the backup can mark a position in the log, and then use that to _not_ backup operations that came after it - which results in a consistant backup. (If your using any MyISM tables, which are not transactional, you can't do this. Hence the use of LVM snapshoting or other exotic techniques).

This it really simple: in the my.cnf for each service, put 'log-bin' (without the quotes) in the [mysqld] section, and restart.

Binary logging is now enabled.

Next, to take a lock free [0] dump, add the --single-transaction flag to mysqldump.

The time taken to actually dump the data to disk won't change, but the database won't be locked for that time.

I did this for one of our LB servers, and then, while the dump was running, submited a job through the WMS. The job was assigned to the LB I was dumping, proving it can be written to, and has now completed, while the dump hasn't yet finished.

I've modified our usual backup script, so that if it detects the presence of /var/lib/mysql/${hostname}-bin.index, which is the index for the binary log, it automatically uses --single-transaction. That way, we still have a single backup script, but it does it the best way possible.

There are a couple of downsides to binary logging: It means the DB has to write more data to disk, so is about 1% slower. As the services are not running at 99% of the cpu, that's ok for us. It also means that each new piece of data is stored twice - once in the DB, and once in the log. Therefore the data storage need grows twice as fast - faster, if there are deletes to the database. I'm looking at an 18GB database - so this won't be a problem. Also, you can purge old logs, so I don't feel that this is a problem any more than the risk of the database expanding over the partition size is.

One thing I'll be looking at is useing the binary logs to take an incremental backup. That'll still not lock the database, but will also be much smaller and faster to take. That's a bit more complicated to arrange, so it'll go into the pile of 'ideas that look nice, but we don't think we need it yet'

As an aside, I think this has to go down as one of the more anticlimatic updates - it was simple, quick and just worked. Unless disk space is very tight, I can't see why one wouldn't enable it.

[0] Technically, it takes a lock, waits for all pending transactions to complete, marks the log position, then releases it. If you have slow operations in flight, it locks it for the duration of that operation.

Wednesday, August 12, 2009

getting ngs.ac.uk voms to work

I have been looking into an issue with the NGS as they are testing submission to the WMS. A ticket was raised as authentication failed on both our production CE's.

This was recreated with by created an ngs voms proxy.

-bash-3.00$ voms-proxy-init -voms ngs.ac.uk --valid 240:00
Cannot find file or dir: /clusterhome/home/gla057/.glite/vomses
Enter GRID pass phrase:
Your identity: /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=douglas mcnab
Creating temporary proxy ................................................................ Done
Contacting voms.ngs.ac.uk:15010 [/C=UK/O=eScience/OU=Manchester/L=MC/CN=voms.ngs.ac.uk/Email=support@grid-support.ac.uk] "ngs.ac.uk" Done

Warning: voms.ngs.ac.uk:15010: The validity of this VOMS AC in your proxy is shortened to 86400 seconds!

Creating proxy ............................................................................ Done
Your proxy is valid until Thu Aug 20 15:34:41 2009


Then with a direct globus-job-run:

-bash-3.00$ globus-job-run svr021.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs "/bin/hostname -f"
GRAM Job submission failed because authentication with the remote server failed (error code 7)
-bash-3.00$ globus-job-run svr026.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs "/bin/hostname -f"
GRAM Job submission failed because data transfer to the server failed (error code 10)


After much investigation, the long and short of it is that even with the correct entries in the groupmapfile and grid-mapfile the issue still occurred. So I checked the VO certificate in /etc/grid-security/vomsdir. This was fine, although there was also the /etc/grid-security/vomsdir/ngs.ac.uk/voms.ngs.ac.uk.lsc which may have been getting used before the VO certificate. So to check I removed the /etc/grid-security/vomsdir/ngs.ac.uk/voms.ngs.ac.uk.lsc

Hey presto, submission worked:

-bash-3.00$ globus-job-run svr026.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs "/bin/hostname -f"
node295.beowulf.cluster
-bash-3.00$ globus-job-run svr021.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs "/bin/hostname -f"
node295.beowulf.cluster


So I think there may be an issue with ngs.ac.uk VO and the lsc file which looked correct.


svr026:/etc/grid-security/vomsdir/ngs.ac.uk# cat voms.ngs.ac.uk.lsc
/C=UK/O=eScience/OU=Manchester/L=MC/CN=voms.ngs.ac.uk/Email=support@grid-support.ac.uk
/C=UK/O=eScienceCA/OU=Authority/CN=CA


This will be an issue in the future on SL5 when VO certificates are deprecated for the lsc file.

the sl5 cluster grows

With a view to a full scale migration of Glasgow's worker nodes from sl4 to sl5 in September we have grown the size of our sl5 test cluster from 8 job slots to 112 job slots.

This is accessible for submission to the following queues:

dev010.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs-q30m
dev010.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs-q6h
dev010.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs-q1d
dev010.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs-q2d
dev010.gla.scotgrid.ac.uk:2119/jobmanager-lcgpbs-q3d

Currently the CE is only advertising and accepting submissions from Atlas and queues are open to sgm/prd/pil account but I am more than welcome to open them to anyone who wishes to test. Just drop me a line and I will create a test software area for any sgm account to install the application software via SL5 and allow access on the CE & Batch Sys for running the jobs.

So far things have been positive for Atlas with software kits now installing on SL5 and attempting to run kit validation. Currently we are failing KV tests, more precisely it failed in the digitization phase, so we then failed the reconstruction.

Nightly builds continue to be run so slowly but surely I'm sure these issues will be ironed out.