Friday, March 28, 2008

brain dead batch systems

why oh why are some of the batch utilities so brain dead? simple case of 'qstat -r' should show who's running jobs right? wrong as it outputs based on a fixed 8 character width for username. doh. so 'prdatlas' and biomed06 seem to be busy. Well not quite as if I do a qstat -f | egrep " e(group|user) " | sort -u I see that it's prdatlas028 and several biomed06? users. grr...

I may install Job Monarch from sara but in the meantime it'll be some hacky PHP to parse the outputs a bit more cleanly

Also, despite having 493 running jobs at the moment (we're down on our capacity as I'm still doing a rolling upgrade to SL4.6 and a new kernel) there are a very small number of users on the system

svr031:~# qstat -f | grep euser | sort -u | wc -l
14


not good, Especially if they decide to take a break.

Thursday, March 27, 2008

p p p pick up a pakiti


We've been using pakiti at Glasgow for some time now for keeping an eye on which nodes are out of date. One minor niggle is that it doesn't keep track of the grub default kernel (ie what should come in on reboot) compared to the running kernel

We already had a v simple shell script that did that:

pdsh -w node[001-140] chkkernel.sh | dshbak -c
----------------
node[001,005,007,014,016-020,022-023,025,028,031-061,063-085,087-090,092,095-096,098-101,103-104,106-107,109-110,113,115,118-120]
----------------
Running: 2.6.9-67.0.7.ELsmp, Grub: 2.6.9-67.0.7.ELsmp, Status OK
----------------
node[062,091,093-094,097,102,105,108,111-112,114,116-117,121-127,129,131,133-134,136-139]
----------------
Running: 2.6.9-55.0.9.ELsmp, Grub: 2.6.9-67.0.4.ELsmp, Status error
----------------
node[003,009,011,013,015,021,027,029]
----------------
Running: 2.6.9-55.0.12.ELsmp, Grub: 2.6.9-67.0.7.ELsmp, Status error
----------------
node[128,130,132]
----------------
Running: 2.6.9-55.0.12.ELsmp, Grub: 2.6.9-67.0.4.ELsmp, Status error
----------------
node[002,004,006,010,012,024,026,030,140]
----------------
Running: 2.6.9-55.0.9.ELsmp, Grub: 2.6.9-67.0.7.ELsmp, Status error
----------------
node[086,135]
----------------
Running: 2.6.9-67.0.4.ELsmp, Grub: 2.6.9-67.0.4.ELsmp, Status OK
----------------
node008
----------------
Running: 2.6.9-67.0.4.ELsmp, Grub: 2.6.9-67.0.7.ELsmp, Status error



but I finally got it integrated with some patching - see http://www.scotgrid.ac.uk/wiki/index.php/Pakiti

result - pretty green / red status on the "default kernel' column.

The patches have been emailed to Romain so may well appear upstream eventually

Wednesday, March 26, 2008

Edinburgh as an LHCb Tier-1?


I've just been accused (jokingly, I hope) of trying to turn Edinburgh into LHCb's 7th Tier-1. The attached plot shows the recent data transfers that I have been running into our dCache. The rates are good (~35MB/s), but not particularly special. However, against a background of zero, it certainly made LHCb jump up and take notice ;) Maybe this will convince them that Tier-2s really can be used for analysis jobs...

I should note that during these transfers one of the dCache pools was about to melt (see below). I've since reduced the max number of movers on each pool to something more reasonable. For the tests, I created a small application that spawned ~50 simultaneously lcg-cp's that were all transferring files from CERN CASTOR to Edinburgh. Who needs FTS when you've got DIRAC and lcg_utils? Now all I need is someone else's proxy and I'll never be caught... ;) But, on a serious note, I suppose this does show that people can create tools to abuse the system and get round the official FTS channels, which could impact the service for other users.

The curse of 79...

Since the dawn of the Glasgow cluster we have been cursed with a low level of globus 79 errors. We did not understand these well, but always believed that they were caused by a confusion in the gatekeeper, where the X509 authentication seemed to suffer a race condition and get muddled between users.

However, since upgrading to an SL4 CE and installing it on a different machine we still get these cropping up (an example).

The GOC Wiki suggests this can be caused by firewall trouble or an incorrect GLOBUS_TCP_PORT_RANGE. Now, this is (and was) correctly defined on both machines to be the standard 20000-25000. However, I have decided to change it to 50000-55000 in case we are tripping some generic nasty filter somewhere else on campus.

Since I did that, last night, we haven't had a 79 error - however this proves nothing so far as we can easily go for a week without one of these happening.

I also contacted the campus networking people to ask if there was any known port blocks in this range.

Data Management and MICE

I had a chat to one of our MICE PhD students a couple of weeks ago and I was explaining how to use EGEE data management (SRMs, LFCs, FTS, lcg utils, etc.). His comment afterwards was "I didn't know I was going to do a PhD in data management...".

The problem is that all these tools are very low level, so any user community has to build a higher level infrastructure on top of this. Obviously the LHC experiments have done this extensively, but it is frustrating that there is no simple generic data management service for smaller VOs who lack the resources of the larger VOs.

I wonder if this accounts for the popularity of SRB in other communities? It may have some limitations, but it clearly offers a higher level data storage, cataloging and metadata service which must be attractive for smaller user communities. Surely there is a potential project to try and tie all of the EGEE components into a sensible data management system?

Saturday, March 22, 2008

Durham - SL4 Install Success!



Durham took the plunge earlier this week to upgrade the CE, SE and all nodes to SL4.6... with success! After our preparation was delayed slightly due to a small UPS failure, we set about installing cfengine to handle the fabric management. This took a little longer than expected but our patience has paid off and it eases the pain of setup and config of clusters. Using the normal RedHat Kickstart to get a base install of SL4.6, we then hand the rest of the setup to cfengine to work its magic (install extra packages, setup config files, run YAIM etc).

Firstly installing a Worker Node was relatively straight forward. Then came the CE along with torque, PBS and the site BDII setup. Thanks to Graeme for help checking our site was working and publishing as expected.

We unexpectedly hit a firewall issue as I had renamed the CE from the old "helmsley.dur.scotgrid.ac.uk" to "ce01.dur.scotgrid.ac.uk"... though I had preserved the IP address. Not what I expected but our network guys were able to fix the rules and we were operational again.

Then the SE followed very quickly afterwards, cfengine and YAIM working their magic very successfully. The procedure was as simple as 1) dump of the database, 2) install SL4.6, 3) Let cfengine do its stuff for a base install, 4) restore the database, 5) Run YAIM. Simple!

Just one gotcha was trying to change the NFS mounted home directories to be local to the nodes. This fails with an error trying to copy the globus-cache-export files. Due to time constraints we have re-enabled the NFS home dirs... but I'm sure this will be simple to fix and I'll look at it next week.

Fair shares and queue time will need reviewing but in all a busy and successful few days. We're passing SAM tests and I've seen Phenogrid, Atlas and Biomed running jobs. Still the UI and a disk server to do, but with cfengine in place, this should be relatively straight forward and will require no downtime.

Wednesday, March 12, 2008

Another ECDF/Grid requirement mismatch.

While ECDF is, in principle, functional and capable of running jobs, this is a bit useless if no-one can see if you're doing it. So, in the face of APEL accounting still not working for the cluster, I had another look.

There were two problems:
Firstly, the sge account parser was looking in the wrong directory for SGE accounting logs - this fails silently with "no new records found", so I didn't notice before. The configured location actually was correct when I set the thing up, but the mount point had been moved since (as the CE is not on the same box as the SGE Master, we export the SGE account directory over NFS to the CE so it can parse them) with no indication that anything was wrong.

Secondly, after I fixed this...
It turns out that the APEL java.lang.OutOfMemoryError strikes again for ECDF.
The ECDF systems team configure the SGE accounting to roll over accounting logs on a monthly basis. Unfortunately, this leads to rather large accounting files:
# ls --size /opt/sge/default/common/acc* --block-size=1M
1543 /opt/sge/default/common/accounting

(yes, gentlemen and ladies, that is a one and a half gig accounting file...and we're only half-way through the month. The archived accounting logs tip the scales at around a quarter to half a gig compressed, but they compress rather efficiently so the "true" size is much larger - up to 10x larger, in fact.)

I suspect the next step is to arrange some way of chopping the accounting file
into bitesized chunks that the APEL log parser is capable of swallowing.
The irony is that we already parse the accounting logs internally using a thing called ARCo - I've not seen any indication that it would be easy to get APEL to understand the resulting database, though.