Thursday, January 31, 2008

perl-TermReadKey - missing in action

We had a ticket from a Zeus user unable to get a file off our DPM, while her Zeus colleagues could. I spend a long time checking the pool accounts, which were all fine, and checking the Zeus VOMS setup, which was also fine.

Finally, I looked in the logs for the grid-mapfile, where the culprit lay:

"Can't locate Term/ReadKey.pm in @INC..."

On two of the servers, disk034 and disk036, the perl-TermReadKey RPM was missing and it looks like the grid-mapfiles had not been rebuilt for a very long time - from the backups it looks like it was October when they were remade!

OK, nagios check: age of grid-mapfile and lcgdm-mapfile!

Down yum! Down!

As reported previously, We discovered that the nightly yum update was enabled on the servers.

The magic cfengine snippet to disable this is:


groups:
# we don't want auto yum update stuff
nonightlyyum = ( `/usr/bin/test -f /var/lock/subsys/yum` )

shellcommands:
nonightlyyum::
"/sbin/chkconfig yum off" umask=022
"/sbin/service yum stop" umask=022


The same check-for-some-enabled-subsys trick can be used to disable many of the periodic checks run on a standard SL install. (it's what 'service foo status' does for many of them)

Wednesday, January 30, 2008

SRM2.2 Configuration for FDR/CCRC

I configured Glasgow's DPM for the SRM 2.2 space tokens required by ATLAS for FDR/CCRC:

svr018:~# dpm-reservespace --gspace 10T --lifetime Inf --group atlas --token_desc ATLASDATADISK
ab0f1a60-59d6-4099-82fa-a17711678860

Easy, eh!

I notice that there is no dpm-listspaces command, which means that one has to grub around inside the database to find out what spaces are currently defined.

Two additional notes for other T2s:

  1. Transfers from DDM are done using a vanilla atlas proxy for now (belonging to Mario Lassnig), so make sure the token is writable by the atlas group, not, e.g., atlas/Role=production.
  2. All that is needed for CCRC is 3TB, however this is based on a 1 week cleaning cycle. If, like Glasgow, you have lots of space, then making the area bigger means the cleaning is not critical. (The space can be updated later with dpm-updatespace.)

Tuesday, January 29, 2008

YAIM stale...

Another problem I found with configuring the new VOs is that the reconfiguration of the information system on the CE failed. Mike had run config_gip, as has been done forever, but it did nothing. So queues and access rights for the VOs were not published.

There are new YAIM functions, like config_gip_ce, as well as a whole different way of structuring site-info.def. I fiddled with various options in the site-info.def file and even in the YAIM function itself, but I couldn't get it to work properly at all.

In some desperation I hacked the ldif file in the end, which is not a long term option; we really need now to look at the whole way that we structure YAIM for the site (as well as plan migrations to SL4 for BDII, CE and DPM...).

OPS goes dark...

Following enabling other VOs yesterday, DPM broke for ops. The error messages were as cryptic as ever:

httpg://svr018.gla.scotgrid.ac.uk:8443/srm/managerv1: Unknown error

And in the SRM logs:

01/29 12:29:02 14830,3 srmv1: SRM02 - soap_serve error : Can't get req uniqueid
01/29 12:05:18 14830,0 srmv1: SRM02 - soap_serve error : CGSI-gSOAP: Could not find mapping for: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=samoper/CN=582979/C
N=Judit Novak

The error seemed to correspond to re-running config_mkgridmap yesterday, however, as Judit (and other ops people) were in the grid-mapfile I was very confused.

Eventually, staring at the lcgdm-mkgridmap.conf I realised that the ops VO was only configured to get voms information from the deprecated lcg-voms.cern.ch server. I reconfigured to get the information from voms.cern.ch and it started to work.

The think I cannot fathom is how is kept working for so long - we have always had lcg-voms.cern.ch as the server for ops.

I updated the ops entry on the GridPP Wiki.

As usual the things which made this much harder then it should have were:

1. It only affected the ops VO (not dteam, atlas or pheno which we can test).

2. The error message was, as usualy, cryptic and unhelpful.

Keeping up with the Jones'

Well we recently had an incident with our NFS server for the cluster (home / software) locking up and needing a cold power cycle. Due to $vendors setup this takes aaaages (in the order of 20 mins) to go through the BIOS selfcheck (hangs at 053C). $vendor would like to poke around system and perhaps perform bios upgrade. Hmm. Oh well, all 10 disk servers are identical so we'll just drain one down and play - it also gives us chance to upgrade (from 1.6.5) to the latest 1.6.7-mumble DPM.


... or so we thought.

disk032:~# rpm -qa | grep DPM
DPM-gridftp-server-1.6.7-1sec
DPM-rfio-server-1.6.7-2sec.slc3
DPM-client-1.6.7-2sec.slc3


"Thats odd - Graeme have you updated these?" nope - Turns out that yum.nightly cron was auto updating on both the disk servers and some of the grid servers... Gaaah. clickity click and we're all ready to play.

In the meantime, dpm-drain migrated most of the data off the server to the other stash of disks but there were still 69 files that failed with 'Internal error' - Am looking through the DB to try and see if I can pull any more info out

Monday, January 28, 2008

Take Me To Your VO...

As noted preciously, we have enabled mice, vo.scotgrid.ac.uk and vo.nanocmos.ac.uk at Glasgow. Mike worked his way through the documentation, which was a out of date because it hadn't incorporated the changes in VO pool account management (which has become much better) which Andrew implemented.

We also plan to setup VOs much more promptly now for local user groups. Now that we have a scotgrid VOMS server this will be much easier.

The wiki has all the updated details.

Resource Broker Blues

Our resource broker was down for the weekend as the network service stalled. Root cause turned out to be a bit of over aggressive cleaning from cfengine. I had wanted to do a better job of cleaning up the /tmp area in the cluster - each worker node had hundreds of condor_g working directories lying around - with nothing in them. cfengine's "tidy" leaves directories alone by default and only cleans files. So I enabled the "rmdirs=sub" option - works beautifully, gets rid of all the cruft in /tmp. So pleased was I that I disengaged by brain and set this option on for /home as well - good idea to clean up those old gass cache areas, isn't it? Well, almost - unfortunately /home has subdirs which are the node pool account home areas and unused pool accounts fall into the clean me up category. All the untouched pool areas then vanished.

This caused a number of people to start getting "unspecified grid manager errors" on globus-job-runs, as well as wiping out the edguser home area on the RB which caused the network server to go into crisis.

It didn't take long to work out what had happened, but fixing it took a while as the resource broker seemed to be quite huffy afterwards.

The only plus side was that I enabled the mice, scotgrid and nanocmos VOs on the RB.

Saturday, January 05, 2008

Happy New Year ScotGrid - now with added ECDF...


Well, we didn't get it quite as a Christmas present, but the combined efforts of the scotgrid team have managed to get ECDF green for New Year.

In the week before Christmas Greig and I went through period of intensive investigation as to why normal jobs would run, but SAM jobs would not. Finding that jobs which fork a lot, like SAM jobs, would fail was the first clue. However, it turned out not to be a fork or process limit, butn a limitation on the virtual memory size which was the problem. SGE can set a VSZ limit on jobs, and the ECDF team have set this to 2GB, which is the amount of memory they have per core. Alas for jobs which fork, virtual memory is a huge over estimate of their actual memory usage (my 100 child python fork job registers ~2.4GB of virtual memory, but uses only 60MB of resident memory). That's a 50 fold over estimate of memory usage!

As SAM jobs to fork a lot, they hit this 2GB limit and are killed by the batch system, leading to the failures we were plagued by.

A work around, suggested by the systems team, was to submit ops jobs to the ngs queue, which is a special short running test queue (15 min wall time) which has no VSZ limit on it.

Greig modified the information system to publish the ngs queue and ops jobs started to be submitted to this queue on the last day before the holidays.

Alas, this was not quite enough to get us running. We didn't find out until after new year that we also needed to place a specify a run time limit of 15 minutes on the jobs and submit them to a non-standard project. The last step required me to hack the job manager in a frightful manner as I really couldn't fathom how the perl (yuk!) job manager was supposed to set the project - in fact even though project methods existed they didn't seem to emit anything into the job script.

Finally, with that hack made this morning, ECDF started to pass SAM tests. A long time a coming, that one.

The final question, however, is what to do about this VSZ limit. The various wrappers and accoutrements which grid jobs bring mean that before a line of user code runs there are about 10 processes running, as 600MB of VSZ has been grabbed. This is proving to be a real problem for local LHCb users, because ganga forks a lot and also gets killed off. Expert opinion is that VSZ limits are just wrong.

We have a meeting with the ECDF team, I hope, in a week, and this will be our hot topic.

Big thanks go to Greig for a lot of hard work on this, as well as Steve Traylen, for getting us on the right track, and Kostas Georgiou, for advice about the perils of VSZ in SGE.