Friday, December 19, 2008

Confessions of a Data Management Systems Manager

After my cunningly timed arrival at Glasgow, barely two weeks before the start of Christmas Break (actually, I suspect they call it "Winter Break" now, although "Io Saturnalia!" would be both more fitting and more amusing), I've tried to hit the ground moving at a vaguely speedy pace on Storage / Data managementy things.

So, as the new Andrew Elwell, here's what I've managed to do so far:

dpm-sql-usage-by-vo-user
Partly as a means of getting myself better acquainted with the arcane mysteries of the DPM, I wrote this useful little tool which produces a pretty-printed output of all the storage used on a DPM, by VO and users within the VO.
Greig and I are planning to stick it in the next release of his DPM Admin Tools package but anyone who wants a beta release can have it if they ask.


DPM performance & xrootd
After the series of ATLAS Analysis Challenges made it increasingly clear that DPM can't produce an effective event rate of greater than about 12 Hz on any of the sites in the challenge, we decided this was worth some investigation. (Interestingly, Tokyo's cluster seems to be capable of getting upto 24 Hz, with DPM.)
At this rate, the DPM head node maxes out CPU, but the network rates from the head node and the pools are very low.

It appears from the DPM logs at Glasgow that the majority of the DPM's time is spent doing X509 authentication on each get request - since each authentication takes around 1.5 seconds, and we need two per request (one on the DPM and one on the disk pool), this is the majority of the time involved in the transfers for small files like the AODs (about 30Mb each).

We thought, therefore, that we'd try disabling X509 auth on the Glasgow DPM and getting another Challenge send to us. This involves some fairly dangerous settings in shift.conf on all the DPM nodes, which we did, and it seemed to work, with a noticeable speed increase, for rfcp on a node.
For some reason, though, the ganga jobs in the Analysis challenge did this:

which is clearly not expected.
We're still not sure why running DPM in "no X509", trusted mode breaks ganga submitted jobs in this way - it didn't break any of ATLAS Production, and rfcp and lcg-cp both worked when we tested them. In any case, we undid these changes sharpish...

The next avenue for testing is alternative transfer protocols other than rfio. Luckily, we have a "spare" DPM, svr025, which I've added xroot support to (thanks to some help from Greig), and will be using to test the benefits and efficiencies of the various DPM plugins vs rfio. Next year, we'll see how I've gotten on...

Thursday, December 18, 2008

Am I seeing double site bdii?

With the imminent move of the development rack we need to move some of the important grid infrastructure out of the current dev rack and into a permanent production home in clustervision. To minimise site downtime we would like to create a temporary scotgrid BDII on svr027 (currently unused). So here goes.....

when running cfagent -qv it ran successfully on svr027 through the files, links, editfiles, packages including the correct glite-BDII packages and copy sections

All was going well until YAIM.

notes from configuring the UI

running yaim for a UI node will configure the UI, /opt/glite/yaim/bin/yaim -c -s /opt/glite/yaim/etc/site-info.def -n BDII_site

this caused the following errors:

cfengine:svr027:m/bin/yaim -c -: INFO: Executing function: config_edgusers
cfengine:svr027:m/bin/yaim -c -: chown: cannot access `/opt/bdii/var': No such file or directory
cfengine:svr027:m/bin/yaim -c -: sed: can't read /opt/bdii/etc/schemas: No such file or directory
cfengine:svr027:m/bin/yaim -c -: INFO: Executing function: config_bdii_only
Stopping BDII27:m/bin/yaim -c -: [FAILED]
cfengine:svr027:m/bin/yaim -c -: Starting BDII [ OK ]

These errors were slightly puzzling but I realised that I had not changed anything in the site-info.def.
So I changed the SITE_BDII_HOST parameter from this:

SITE_BDII_HOST=svr030.$MY_DOMAIN

to this:

SITE_BDII_HOST="svr030.$MY_DOMAIN svr027.$MY_DOMAIN"

and re-ran /opt/glite/yaim/bin/yaim -c -s /opt/glite/yaim/etc/site-info.def -n BDII_site

This time the only error was:

sed: can't read /opt/bdii/etc/schemas: No such file or directory

but the configurator still produced:

INFO: Configuration Complete. [ OK ]
NFO: YAIM terminated succesfully.

checking the /opt/bdii/etc on svr027 I had this:

svr027:/opt/bdii/etc# ls -la
total 64
drwxr-xr-x 2 edguser edguser 4096 Dec 17 16:17 .
drwxr-xr-x 6 root root 4096 Dec 17 15:55 ..
-rw-r----- 1 edguser edguser 503 Dec 17 16:17 bdii.conf
-rw-r--r-- 1 edguser edguser 2535 Oct 13 13:54 BDII.schema
-rw-r--r-- 1 edguser edguser 50 Oct 13 13:54 bdii-update.conf
-rw-r--r-- 1 edguser edguser 634 Oct 13 13:54 DB_CONFIG
-rw-r--r-- 1 edguser edguser 246 Oct 13 13:54 default.ldif
-rw-r--r-- 1 edguser edguser 1783 Oct 13 13:54 glue-slapd.conf

checking this against svr030 I had this:

svr030:/opt/bdii/etc# ls -la
total 48
drwxr-xr-x 2 edguser edguser 4096 Oct 8 10:35 .
drwxr-xr-x 6 root root 4096 Feb 10 2008 ..
-rw-r--r-- 1 edguser edguser 364 Oct 8 10:35 bdii.conf
-rw-r--r-- 1 edguser edguser 50 Feb 10 2008 bdii-update.conf
-rw-r--r-- 1 edguser edguser 377 Feb 10 2008 indexes
-rw-r--r-- 1 edguser edguser 268 Oct 8 10:35 schemas

very different!

I then decided to reboot and try again from scratch just to make sure there was nothing hanging around from the previous failure.
When I installed everything in the same way. The file structure still appeared different. So i decided to test the site level BDII to see if it actually worked.

svr027:/opt/glite/yaim/etc# ldapsearch -xLLL -b mds-vo-name=UKI-SCOTGRID-GLASGOW,o=grid -p 2170 -h svr027.gla.scotgrid.ac.uk > svr027.txt
svr027:/opt/glite/yaim/etc# ldapsearch -xLLL -b mds-vo-name=UKI-SCOTGRID-GLASGOW,o=grid -p 2170 -h svr030.gla.scotgrid.ac.uk > svr030.txt


this was then compared: cat svr027.txt | sort > ldapsvr027.txt;cat svr030.txt | sort > ldapsvr030.txt;diff -y ldapsvr027.txt ldapsvr030.txt | grep '>' | grep '.gla.scotgrid'

On comparing the output from an ldap search it apparent that something was missing as their output showed some missing servers. After a quick discussion with Sam we found the file /opt/glite/etc/gip/site-urls.conf and noticed the differences: the DPM2 and BDII_TOP i.e. svr025 and svr019

svr027:/opt/glite/etc/gip# cat site-urls.conf
CE ldap://svr021.gla.scotgrid.ac.uk:2170/mds-vo-name=resource,o=grid
CE2 ldap://svr026.gla.scotgrid.ac.uk:2170/mds-vo-name=resource,o=grid
DPM ldap://svr018.gla.scotgrid.ac.uk:2170/mds-vo-name=resource,o=grid
WMS ldap://svr022.gla.scotgrid.ac.uk:2170/mds-vo-name=resource,o=grid
WMS2 ldap://svr023.gla.scotgrid.ac.uk:2170/mds-vo-name=resource,o=grid
BDII ldap://svr027.gla.scotgrid.ac.uk:2170/mds-vo-name=resource,o=grid
VOBOX ldap://svr024.gla.scotgrid.ac.uk:2170/mds-vo-name=resource,o=grid

svr030:/opt/glite/etc/gip# cat site-urls.conf
CE ldap://svr021.gla.scotgrid.ac.uk:2170/mds-vo-name=resource,o=grid
CE2 ldap://svr026.gla.scotgrid.ac.uk:2170/mds-vo-name=resource,o=grid
DPM ldap://svr018.gla.scotgrid.ac.uk:2170/mds-vo-name=resource,o=grid
DPM2 ldap://svr025.gla.scotgrid.ac.uk:2170/mds-vo-name=resource,o=grid
WMS ldap://svr022.gla.scotgrid.ac.uk:2170/mds-vo-name=resource,o=grid
WMS2 ldap://svr023.gla.scotgrid.ac.uk:2170/mds-vo-name=resource,o=grid
BDII ldap://svr030.gla.scotgrid.ac.uk:2170/mds-vo-name=resource,o=grid
BDII_TOP ldap://svr019.gla.scotgrid.ac.uk:2170/mds-vo-name=resource,o=grid
VOBOX ldap://svr024.gla.scotgrid.ac.uk:2170/mds-vo-name=resource,o=grid

after updating svr027 and restarting /etc/init.d/bdii restart we now have a operational site BDII on svr027.

The question is, should these additional entries be in /var/cfengine/inputs/skel/yaim/services/glite-bdii?

On with the move!

Update: svr027 is currently the only SITE_BDII in the GOC DB

Wednesday, December 17, 2008

Analysis Challenge: Round 4

We re-run the analysis challenge yesterday with a better MySQL setup so that the higher number of dpns daemons could get db connections. However, the results were much the same as before and the conclusion seems to be that X509 sucks - it's killing the headnode with all of the simultaneous authentications.

We hope to prove later on that this is the real problem and then think about what we can do about it...

Tuesday, December 16, 2008

Development / PreProd : The UI

I thought my first foray into grid middleware installations deserved a quick blog so here goes. Apologies in advance if I am covering old ground.

With grid02 now defunct and dev008 very much part of the alive and kicking it was time to install the required packages/middleware and configure it to run as a UI.

First thing for me was to understand/create a cfagent script for the new host. After much deliberation of wishing to keep it all separate and out of the way of the main production script. I decided to add it into the main script to save duplication. Perhaps something to think about for the future may be to split this up into much smaller modules per host and import a few common modules. Although, at this stage I am inclined to go with the old adage, "don't fix it if it ain't broken". I have also heard/read much of puppet which is built on cfengine with bells and whistles. Perhaps something to look at? Anyway, on with the install.

Once the script was created I ran cfagent -qv . However, beginner's luck was thin on the ground and it failed to install the packages properly first time around.

First off there was a missing dependency:

cfengine:dev008: --> Processing Dependency: perl(URI::URL) for package: perl-libwwError: Missing Dependency: log4cpp >= 1.0 is needed by package glite-ce-cream-client-api-c
cfengine:dev008: Error: Missing Dependency: liblog4cpp.so.4 is needed by package glite-ce-cream-cli

The fix was to include the DAG repo onto dev008 to pull a later version of log4cpp.
However, there was some issues surrounding this as the UI is 386 and the grid machines we have are generally 64 bit machines.
So the DAG repo url in /etc/yum.repos.d/dag.repo had to be fudged to change the /$basearch variable to i386

After a yum clean all I ran cfagent -qv again. This resulted in a second error:


cfengine:dev008: Transaction Check Error: file /usr/share/java/jaf.jar conflicts between attempted installs of geronimo-jaf-1.0.2-api-1.2-11.jpp5 and sun-jaf-1.1-3jpp
cfengine:dev008: file /usr/share/java/jaf_api.jar conflicts between attempted installs of geronimo-jaf-1.0.2-api-1.2-11.jpp5 and sun-jaf-1.1-3jpp


This was a known error with the middleware install and the fix was to run yum install glite-UI --disablerepo=jpackage17-generic

After a third run of cfgent -qv it was good to go or so I thought. What I did see was that it was running YAIM and failing. Therefore, I opted to run YAIM manually. Using the normal UI command, /opt/glite/yaim/bin/yaim -c -s ../etc/site-info.def -n UI I generated the following error:


INFO: Executing function: config_workload_manager_client_setenv
INFO: Executing function: config_workload_manager_client
ERROR: RB_HOST is not set
ERROR: One of the functions returned with error without specifying it's nature !


After a quick cat of the site-info.def, indeed RB_HOST is commented out as presumably the WMS is in there instead.


WMS_HOST="svr022.$MY_DOMAIN svr023.$MY_DOMAIN"
LB_HOST="svr022.$MY_DOMAIN svr023.$MY_DOMAIN"
#RB_HOST=svr023.$MY_DOMAIN


I managed to amend the local site-info.def before cfagent set it back to the original value and this allowed YAIM to get further. After reading some sites, I opted for this config as it appeared that you could actually have WMS_HOST and RB_HOST defined in the one file. Perhaps a WMS install will not like this setting? We will have to see.


WMS_HOST="svr022.$MY_DOMAIN svr023.$MY_DOMAIN"
LB_HOST="svr022.$MY_DOMAIN svr023.$MY_DOMAIN"
RB_HOST=$WMS_HOST


running yaim again: opt/glite/yaim/bin/yaim -c -s ../etc/site-info.def -n UI now returned some errors when build the globus core:


gpt-build ====> Changing to /etc/grid-security/vomsdir/BUILD/globus_core-4.30/
gpt-build ====> BUILDING FLAVOR gcc32
GLOBUS_LOCATION=/opt/globus; export GLOBUS_LOCATION; GLOBUS_CC=gcc; export GLOBUS_CC; /etc/grid-security/vomsdir/BUILD/globus_core-4.30//configure --with-flavor=gcc32
Dependencies Complete
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether to enable maintainer-specific portions of Makefiles... no
checking for style of include used by make... GNU
checking for gcc... no
checking for cc... no
checking for cc... no
checking for cl... no
configure: error: no acceptable C compiler found in $PATH
See `config.log' for more details.


It looked bizarrely like gcc is not installed by cfengine on a sl4.i386 version by default, so to fix: yum install gcc . After checking the cfagent.conf this does appear to be the case. There are lots of additional packages for sl4.x86_64 but not for i386. Should this be the case?

After another re-run of yaim: opt/glite/yaim/bin/yaim -c -s ../etc/site-info.def -n UI


INFO: Configuration Complete. [ OK ]
INFO: YAIM terminated successfully.


This looked better and after sourcing the grid-env that had just been installed: source /etc/profile.d/grid-env.sh commands like: voms-proxy-init -voms vo.scotgrid.ac.uk were successful. In fact I was able to submit a job and retrieve its data from dev008. So installation successful. Or so I thought. I updated the cfagent.conf and ran it all from cfengine.

Cfengine appears to make two passes. The 1st pass install works correctly. It installs the UI, configures through YAIM. However, since some of the fileedits and links rely on the existence of a configured glite they actually fail on the first pass i.e


cfengine:dev008: Error while trying to link /opt/glite/bin/python2 -> /usr/bin/python32
cfengine:dev008: Error while trying to link /opt/glite/bin/grid-proxy-init -> voms-proxy-init
cfengine:dev008: Error while trying to link /opt/glite/bin/grid-proxy-info -> voms-proxy-info
cfengine:dev008: Couldn't stat /opt/glite/etc/glite_wmsui_cmd_var.conf - no file to edit
cfengine:dev008: statcfengine:dev008: Couldn't stat /opt/edg/etc/edg_wl_ui_cmd_var.conf - no file to edit
cfengine:dev008: statcfengine:dev008: Couldn't stat /opt/glite/etc/gaussian/glite_wms.conf - no file to edit
cfengine:dev008: statcfengine:dev008: Couldn't stat /opt/glite/etc/gaussian/glite_wmsui.conf - no file to edit


I had expected these to be caught on the second pass as glite was installed and configured but that run of cfagent -qv does not pick them up on the second pass. When cfagent -qv is ran a second time it does update the files appropriately. Not sure this is the behaviour we want. Does anyone remember if this happened with the original UI? Currently the dev008 is using all the original classes for ui and clusterui at the moment and should be running in the same way as the original UI install.

So to summarise the questions:

  1. Can you set RB_HOST and WMS_HOST in the same site-info.def?
  2. Are there lots of packages missing for a sl4.i386install?
  3. Does anyone remember from the original UI install what happens when it updates files on the second pass?

So a partial success, now onto a WMS.

Wednesday, December 10, 2008

Plots from last analyais challenge




Mostly confirmed the results which we saw at the end of last week's test. Load on the DPM headnode is our pressing concern - it's maxing its CPU out even at open rates of a little over 1Hz.

Monday, December 08, 2008

Analysis Challenge: Round 3



Last week's analysis challenge at Glasgow showed extreme load and sluggishness in the DPM (see the attached plots of awfulness). Although we managed a much better event rate we also suffered from incomplete processing and the DPM was a clear bottleneck.

I had a chat with JPB today who spotted the very high memory consumption of the dpm daemon - he thinks there's probably a memory leak and that this might be slowing things down. He also said it might be worth running more dpns daemons as these also do connection athentication.

So, to get ready for tomorrow I have:
  1. Allowed core dumps for the DPM and DPNS daemons.
  2. Increased the number of threads in the DPNS daemon to 60.
  3. Restarted all the daemons.
That last operation freed up about 3GB of memory!

If we still see problems tomorrow then at least we should have some good information for the developers to chew on.

Friday, December 05, 2008

You want processing power?



Well we've got it at Glasgow; over 2,900,000 SI2k worth now, since we commissioned our cluster extension this week.

Following a period of, ahem, rigorous stress-testing (see Graeme's posting), we started releasing nodes to the Grid on Wednesday, and our first job (an ATLAS production task) arrived almost immediately. To date, the new nodes alone have handled over 14,000 jobs.

This now means that Glasgow are currently top of the UKI leaderboard in terms of raw processing power and, according to Gridmap, only behind RAL-LCG2 for the number of job-slots available.

For my next trick, I will make 400TB of storage appear...as if by magic...

Wednesday, December 03, 2008

Farewell and godspeed! Welcome!

Last week we bid a fond farewell to Andrew, who has moved on to a job in the gLite team at CERN. He did a great power of work of us, tweaking networks, pioneering regional nagios and generally being a smart and useful guy. Good luck to him... and we know where you live if it goes wrong :-) Doubtless we'll see you in R1 for a beer.

Andrew's replacement will be Sam, who moves along the M8 from Edinburgh to Glasgow and will start very soon (next week). We're very happy to have someone in the role who already knows grid so well.

Finally, I should bid welcome to Dug McNab, who started a few weeks ago as the ScotGrid EGEE T2 Co-ordinator. Welcome to him. Dug, among other things, has the task of teaching all the non-LHC people how to do data management properly!

All go at ScotGrid

This is a quick update to make up for the fact that we've been too busy to blog here in ScotGrid land - lack of activity in the blog rather indicates a frenzy of activity on the ground!

Glasgow:
  • The new Viglen hardware arrived, was installed and passed its acceptance test without any problems. However, we did have severe air conditioning issues in the new computer room which prevented us from actually switching on the new kit in anger (we didn't want it to cook itself!). These were cured at the end of last week, when a failover between the two chilled water pumps was installed. Since then Mike has been proving the new worker nodes in the batch system and we're on the point of bringing the new nodes online.
  • Meanwhile, in ATLAS land, I have been helping to organise UK Distributed Analysis Challenge. This has been hammering our system with 100s of ATLAS user analysis jobs. The first round we had inherited a bad setting of rfio readahead, so we delivered GB of data to the jobs which they did not want. Second time around this was cured, but it looked like we had serious load issues on the DPM headnode and some files could not be opened by jobs. What's worrying here is that we peaked at about 110 user analysis jobs running simultaneously, yet DPM really struggled to keep up with the rate of opens - to be investigated later.
  • On the middleware front I installed a new CE (svr026) to provide redundant access to the batch system and a 'hot spare' DPM (svr025) which is there to (a) investigate peculiar client timeout errors we see with svr018 (do they repeat? initial answer seems to be no) and (b) provide a 'ready to go' DPM headnode if anything unfortunate happens to svr018.
Edinburgh:
  • ECDF has been working much better using mw05, the new SL4 SGE CE. Also, thanks to continual pressure from Phil, we nailed the last of the VSZ problems (the sgm accounts had the low VSZ limits which caused the installation fo software to fall over in very peculiar ways). Since then ATLAS as run very well at ECDF.
  • Continuing the CE improvements, Sam and Steve hope to introduce a second ECDF CE and retire the old SL3 CE very soon.
Durham:
  • Durham's new kit (all 1MSI2K of it) should arrive very soon now, so they will revamp the whole cluster and dump the old kit. They will be in downtime for a while as this happens. They are taking the ScotGrid lead on virtulaising services which we see as a really important step to providing rapid recovery from equipment failures and lots of flexibility in deployment.
Finally, we have seen a welcome return of LHCb production jobs; had some serious gripes with biomed (I think they are disabled on all our SEs now) and seen some excellent SAM test figures for all the sites, despite generaly being full to the gunnels with jobs.