ScotGrid

Friday, January 29, 2010

ScotGrid's shrink wrapped UI

In an effort to reduce overhead for new external users who wish to submit to Glasgow I have created a shrink wrapped gLite UI. This comes in the form of a slimmed down Virtual Box SL5 image with pre-installed gLite UI.

The hope is that users from external institutions who wish to run jobs on the EGEE grid and more specifically at ScotGrid will be able to take advantage of this. This is of particular importance for external users of Lumerical's FDTD who are primarily engineers who just want to run the software rather than install an SL5 gLite UI first. The end goal is extending this to help all our users get up and running as quickly as possible.

This will come pre-installed with Glasgow's submission tools such as gqsub and other more specific user scripts. Those wishing a link to download the VM should drop us an email.

Details of the VM image, setting up the UI are available at the wiki.

I am still at a loss how CERN managed to get their VirtualBox image down to 500GB!

Tuesday, January 19, 2010

pick a torque, any torque

Since our seg-faulting mom issue during our SL5 upgrade using 2.3.6 server & mom I have compiled a variety of Torque versions of late and trialled them out. I have now come to some conclusion and am sticking with the 2.3.* series. 2.3.9 at the moment - well until another bug is found!

2.3 Series

2.3.6 - seg-faulting mom during some unidentifiable race condition
2.3.7 - untested
2.3.8 - Operators/Managers Lists Bug
2.3.9 - Seems stable

2.4 Series - Beta

2.4.2 - OSC MPIEXEC Bug
2.4.3 - OSC MPIEXEC Bug Fixed & Operators/Managers Lists Bug
2.4.4 - OSC MPIEXEC Bug Back in

Tuesday, January 12, 2010

Leaks caused by frozen ICE

We had a rather quiet time here over winter - a slight hiccough with a disk server, but all rather stable. Other than that, the big freeze didn't result in much.

Except for the ice freezing up, and causing leaky pipes.

That's ICE - the WMS plugin that submits to CREAM. It turns out that it can break the pipes, and leak bits of past jobs. This resulted in an error message like:

Warning - Unable to submit the job to the service: https://svr022.gla.scotgrid.ac.uk:7443/glite_wms_wmproxy_server
System load is too high:
(Not all processes could be identified, non-owned process info will not be shown, you would have to be root to see it all.)
Threshold for ICE Input JobDir jobs: 1500 => Detected value for ICE Input JobDir jobs /var/glite/ice/jobdir : 1514

from both WMSen. In principle, this is reasonable: it's saying that the WMS is loaded up, so no more jobs for the moment. A decent way of ensuring running jobs are not harmed by new submissions, in the event the system explodes.

Except the system load was the lowest I've seen on them, under 1.0. Dug dug into the underlying Condor instance, which had only a few jobs in it, and the hunt commenced for the 1500 phantom jobs.

As the error message suggests, /var/glite/ice/jobdir/old had 1514 files in it, each one representing a past job. However, most of these were old - over a month old. Given that the WMS is supposed to purge the jobs after a month (if they user doesn't do it earlier), that shouldn't have been the case.

Derek down at RAL confirmed this - it's apparently a known bug; but I can't quite see it on the gLite Known issues page. It looks like most of the UK's WMS's fell to this at the same time. I think that's due to the increased number of CREAM CE's (so the rate of use of ICE is climbing), and the fail over on the clients if one WMS is down - resulting in a nice, even distribution of failure.

In the end, the fix was simple - I moved the files older than a month out of /var/glite/ice/jobdir/old. Deletion aught to be safe, but they're tiny. I'll need to automate that, until such time as the bug is fixed - but also need to watch in case the usage increases further, and 1500 isn't enough to last out a month of use. In that case, I think I'd probably temporily increase the
limit on the WMS (I believe it's in a configuration file), knowing that most of them are stale phantoms.

The only discussion I can find related to that error message resulted in pointing the finger at the glite-wms-ice-safe process. ICE has two processes, and it appears that the ice-safe is the part responsible for cleaning up. However, as far as I can tell, both processes are running on each of our WMS's, so this appears to be a different case from the previous one. It might have been the case that the ice-safe process died, and when it's restarted it's not removing old jobs? I don't know - if I find out I'll update here.

The purpose of this post is to get the error message from the WMS into google, and on the same page as something that talks about the issue; and resolution. In case it freezes up on us again.

Monday, December 14, 2009

(Almost) 100% Success for Glasgow!

We noticed today that (apart from that pesky red mark for CMS over the past 6 months, and some yellow on the WMS tests) we're looking incredibly green and functional on the Glasgow Dashboard at the moment. So, we took a picture before that changed...

(The Glasgow Dashboard is Mike's mashup of all the useful metrics on the web concerning UKI-SCOTGRID-GLASGOW, now over two, alternating, pages. It's actually quite useful, and remarkably festive at this time of year.)

Thursday, December 10, 2009

openmpi magic

I have just rebuilt openmpi-1.3.4 for use with CASTEP. This is built to a useful /opt location with Torque, F90 support for gfortran44.

gLite support for OPENMPI is fairly generic and means any openmpi rpm install does not have any useful batch or interconnect support. So anything out of the ordinary requires a custom build.

I will stick the RPM available for download here in the next few days.

The magic for building from a src rpm:


cd /usr/src/redhat/SPECS/
rpmbuild -ba --define '_prefix /opt/openmpi-1.3.4' --define '_mandir %{_prefix}/share/man' 
--define 'configure_options --prefix=/opt/openmpi-1.3.4 --with-tm=/usr/ 
FC=gfortran44 F77=gfortran44 CC=gcc44 CXX=g++44 FFLAGS=-O2 FCFLAGS=-O2 CFLAGS=-O2 CXXFLAGS=-O2' 
openmpi-1.3.4.spec

Full Instructions are here.

Tuesday, December 08, 2009

issues with gfortran43/44 and mpich

I am finally getting to the bottom of what has been going wrong with re-compiling MPICH for F90/F95 (required for CASTEP - a demanding Fortran code). I have now narrowed it down to one issue between recompiling MPICH with gfortran43/44 for SL5 usage with CASTEP instead of plain old gfortran.


FC="gfortran44" ; export FC;
F90="gfortran44" ; export F90; 
...
--enable-f90modules

The SAM MPICH test runs after recompiling with F90 support using gfortran and in fact it works fine on SL4 and SL5. So that was not the issue.


message size            transfertime            bandwidth
32 bytes                0.000000 sec            inf MB/s
2048 bytes              0.000117 sec            17.476267 MB/s
131072 bytes            0.001445 sec            90.687654 MB/s
8388608 bytes           0.078437 sec            106.946397

It turns out that MPICH just doesn't work when compiled with gfortran43/44. Leaving me in a bit of an pickle as CASTEP will not compile on SL5 with gfortran, you have to use gfortran43/44!

Time for the backup plan ..... openmpi.

Thursday, December 03, 2009

lightening testing of glexec with SCAS

Well since it is looking increasingly lightly that we will be moving to some form of identity switching at our sites to give us more information about who is running their jobs via their pilot frameworks. I thought I would give it whirl.

So in some lightening tests, a phrase I am stealing from lightening talks sometimes given at technical conferences, I am trialling glexec for identity switching coupled with SCAS for centralised allow/deny decisions.

Here is what was tested:

an install of SCAS
and install and test GLEXEC with SCAS on LCG-CE
and install and test GLEXEC with SCAS on CREAM [1]
and install and test GLEXEC on WN (SL4)
and install and test GLEXEC on WN (SL5)

Detailed Instructions and Results can be found here

The short and long of it is that it is very easy to set-up SCAS and use it on whatever service you want. So easy infact that once you SCAS server is up and running you cn direct calls to it from your CE's in a matter of minutes. glexec on the WN is just as easy, all that remains would be for someone to use it.

We currently have not rolled any of this into production but I am confident that it could be done quickly and safely. Since we are into real data taking, safely is the keyword. We want no unnecessary downtimes, which I think is achievable.

Thanks to Oscar at Nikhef for answering questions.

1: there appeared to be a certificate permission issue when calling SCAS from CREAM that prevented job submission. It looks like you need to copy the hostcert/key by hand to another cert owned by the tomcat user.


-rw-r--r--   1 tomcat tomcat   2187 Dec  4 10:44 tomcathostcert.pem
-r--------   1 tomcat tomcat   1863 Dec  4 10:44 tomcathostkey.pem

Friday, November 27, 2009

Mysql binary logging revisited

After last time, I'd poked at the LB servers databases, so we were getting effectively lock free backups, on one of the servers.

In the intervening period, after it was seen to be stable, I did the same for the other server.

However, Mike noted that the disk space used for the logs was growing rapidly. (I blame those pesky LHC physicists. Running jobs on our systems - anyone would think there was data to analyse or something ...). Because we're running with LB servers on the same machines as the WMS, this means that the /var partition contains both the database files, and the users sandbox - hence the old binary logs take space away from users stuff. (That's something to think about for the reinstall - might be worth separating them).

Time to automate log triming. Firstly, the manual side: the statement

PURGE BINARY LOGS BEFORE '2009-11-01 00:00:00';

to the server trims out some of the older logs. You can also trim up to a given file.

Better than that, however, is to put

expire_logs_days=8

in the my.cnf. This tells mysql to retire logs older than 8 days at server start up, or when the logs are flushed.

So as long as we ensure that when we take a full backup we flush the logs, then logs are automatically trimmed to just over a weeks worth. Adding that parameter to the mysqldump script, and we're done.

The binary logs have value, independant of the backups - there's a tool to read them, and look at what was happening. Whether 8 days is the best level for us is something that we'll have to monitor - arguments for shorter time periods seem stronger than for longer.

Wednesday, November 25, 2009

Torque 2.4.2 to the rescue

I previously blogged about our Torque 2.3.6 on SL5 mom's continually seg-faulting. At first we thought it was a bitness issue 32/64 between our SL4 and SL5 mom's running through the same pbs_server. However, a quick test with the SL4 nodes removed proved that this was not the case. A trawl through the source proved unproductive.

Therefore, it was time to go to Plan B. To that end I have built the latest Torque release 2.4.2 and tested this on our pre-prod staging cluster. This worked well with a configuration of 2.3.6 server and 2.4.2 mom's. The next test was a test on a single node in production. This was successful and was running jobs fine when all the other mom's seg-faulted again. The 2.4.2 mom survived this and continued to run. So a full roll-out is under way. We will think about upgrading the server at a later date. The only point to note is that we have to fully drain a node before doing the upgrade which is pain. It does attempt a job conversion but these are unsuccessful as far as we can tell and you end up with dead job holding onto job slots.

So the moral of the story is stay away from 2.3.6 and go to 2.4.2 instead.
It is pretty easy to build but I have hosted our build here for anyone that wants them.

Tuesday, November 24, 2009

A tale of two job managers

A while back I posted about supporting SL4 and SL5 OS's through the same batch system. Our solution was to use torque submit filters to add additional node properties to the jobs as they passed through the job managers on the CE's. This coupled with specific node properties on all nodes on the cluster worked quite well until I noticed that we were leaking CREAM jobs that should have requested SL5 running on the SL4 nodes.

After some investigation it appeared that when I was testing the filter and running the cream pbs qsub submit by hand I was always setting the number of nodes, even if I only required 1 i.e.

 
as a pool account .... 
/opt/glite/bin/pbs_submit.sh -c /bin/hostname -q q1d -n 1

This meant that there was always a #PBS -l nodes= in the submission script. However, if you call pbs_submit without the -n you get behaviour where no #PBS -l nodes= line appears in the final submission script. This then relies on the default behaviour that if no number of nodes is specified you get 1 node. This meant that my pbs filter did not catch the number of nodes and did not add the node property at all!

As it turns out on deeper investigation into the CREAM pbs_submission script. That when 1 node is required it uses the pbs default behaviour and does not specify a number of nodes. Only when there is more than one does it specify this i.e. MPI. This is a change from the lcg_CE job manager which always specifies a number of nodes be it 1 or more. Something to remember.

To get round this I have added an additional line to the cream pbs submit script to always default to 1 node if not MPI. Not the best but it's a short lived tweak until we get rid of our SL4 support. This should be very soon.

/opt/glite/bin/pbs_submit.sh


[ ! -z "$bls_opt_mpinodes" ] || echo "#PBS -l nodes=1" >> $bls_tmp_file

Thursday, November 19, 2009

CE Publishing

The Problem

Publishing an inhomogeneous site 'correctly' is not trivial. This is now required in order to pass the new gstat2 Nagios tests. Things to remember -

* Physical is sockets/CPU's and Logical is Cores.
* Physical * Cores = Logical in order to pass the new central Nagios tests.

If your cluster is inhomogeneous then you need to be able to publish both clusters separately or as one or come up with a fudged number. It is made harder as we have one batch system with multiple CE's submitting to it.

Some Solutions

* Sub-Clusters [ what we have implemented at Glasgow ]
* Publishing decimal for cores

our implementation is discussed here.

Please let me know if anything is wrong with this and I will update.

Segfaulting PbsMoms

We have an issue with segfaulting mom's that seems correlated with the server trying to ping it's moms. The server are version is torque-2.3.6-2cri.x86_64
We are currently supporting two OS's through the same batch system using submit filter and node properties. Therefore, we have two different versions of moms.
Nodes 1->295 have moms torque-2.3.6-2cri.x86_64 and 296->309 have moms torque-2.1.9-4cri.slc4.i386

When the moms segfault we see that the torque-2.1.9 moms stay up and only the torque-2.3.6 moms all die. I ran one of them through GDB and can see the call stack:


(gdb) where
#0  mom_server_find_by_ip (search_ipaddr=177078032) at mom_server.c:450
#1  0x000000000041965e in mom_server_valid_message_source (stream=0) at mom_server.c:2022
#2  0x0000000000419870 in is_request (stream=0, version=1, cmdp=0x7fffff542ae8) at mom_server.c:2125
#3  0x0000000000416997 in do_rpp (stream=0) at mom_main.c:5351
#4  0x0000000000416a52 in rpp_request (fd=) at mom_main.c:5408
#5  0x00002ae8ae9f3bc8 in wait_request (waittime=, SState=0x0) at ../Libnet/net_server.c:469
#6  0x0000000000416c1d in main_loop () at mom_main.c:8046
#7  0x0000000000416ee1 in main (argc=1, argv=0x7fffff5431d8) at mom_main.c:8148
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) n
Program not restarted.
(gdb) bt full
#0  mom_server_find_by_ip (search_ipaddr=177078032) at mom_server.c:450
       __v = 
       pms = (mom_server *) 0x6cbb80
       addr = 
#1  0x000000000041965e in mom_server_valid_message_source (stream=0) at mom_server.c:2022
       addr = (struct sockaddr_in *) 0x187ef434
       pms = (mom_server *) 0x0
       id = 0x43be08 "mom_server_valid_message_source"
#2  0x0000000000419870 in is_request (stream=0, version=1, cmdp=0x7fffff542ae8) at mom_server.c:2125
       command = 
       ret = 0
       pms = 
       ipaddr = 
       id = "is_request"

So it looks like time to dive through the source for mom_server_find_by_ip (search_ipaddr=177078032) at mom_server.c:450 or install torque-2.4!

Tuesday, November 17, 2009

Arc, authorisation and LCMAPS

As a gLite site, it would be ideal if we could have the same user mapping between certificate DN's, and unix user names that is used with our existing CE's.

Which means using the gLite LCMAPS to make decisions about what username each user has.

This is supported in Arc, but it's not in the same fashion.

The best approach appears to be: Have an initial mapping listed in the grid-mapfile (There's utilities to make this easy). This allows a first pass of authorisation. Then, in the gridFTP server, the mapping rules in there are applied next - this is where LCMAPS applies.

Interestingly, Arc makes it very easy to do the thing we found hard with LCMAPS - to have a small set of 'local' users with fixed permanent mappings (independant of VO), and VO based pool accounts for other users.

However, it's in the LCMAPS integration that things get a bit stuck.

It's a silly 32/64 bitness issue. On a 64 bit system, yum pulls out the 64bit Arc - as you might expect. Sadly, there's not a 64 bit version of LCMAPS in the repositories as yet.

So it's a case of hacking what I need out of etics. I'll post a recipe when I have one, but this is a pretty tempory situation - it looks like Oscar pretty much LCAS/LCMAPS ready, but they're not a separate package, so are waiting on the SCAS, CREAM or WMS SL5-64bit packages.

Wednesday, November 11, 2009

nmon

Seeing sams post about NFS prompted me to mention 'nmon' - its kinda like 'top' on steroids and does particularly useful trend plotting. Originally a hack for AIX but ported to linux once a certain vendor realised people weren't just buying powerpc systems....

Anyway - go grab from http://www.ibm.com/developerworks/aix/library/au-analyze_aix/ - the linux version is now opensource I see.

NFS Load Tweaks: a Brief Guide for the Interested Enthusiast

I was asked about the mystery of NFS server tweaking in a dteam meeting, so I thought I'd compile this brief blog post.
As with all actions, there are two steps: first, gather your information, second, act on this information.

1) Determining your current NFS load statistics.

NFS logs useful information in its /proc entry...

so:

> cat /proc/net/rpc/nfsd

rc 0 28905480 1603148913
fh 133 0 0 0 0
io 3663786355 2268252
th 63 362541 16645.121 3156.556 747.974 280.920 148.129 100.155 61.480
42.249 40.829 90.461
ra 256 1069115586 4089582 3055815 2625032 2228952 2114496 1983622
1765372 1743563 1610465 89609536
net 1634942152 0 1634971040 2214677
rpc 1630024431 0 0 0 0
proc2 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
proc3 22 1573543 1535237104 8743056 1545350887 1532645717 29571823
1179900114 9214599 6691508 538717 366274 0 2801854 39816 505310 4298
2486034 62181794 53164 2414727 0 986878
proc4 2 0 0

This somewhat arcane looking output is full of variously useful
statistics about your nfs daemon.

The "rc" (read cache) field gives the fraction of cache hits, misses
and "nocache" (interactions which bypassed the cache) for read
operations.

The "fh" (file handle) field's most important entry is the first - the
number of stale file handles in the system. If you have flaky NFS, for
example, this will be non-zero.

The io field is simple cumulative io (read, and then written) in bytes.

The "th" (threads) field is the most interesting field for NFS load
optimisation. The first entry is the total number of threads currently
executing. The second is the number of seconds (?) all threads were in use
(which means your NFS was maxed out in active connections). The
remaining 10 entries are a histogram of NFS thread utilisation, in
seconds (it seems to be hard to get NFS to reset this; restarting the
daemon definitely doesn't). Plotting this gives you an idea of how
much time your NFS server spends in various load states.
Ideally, you want the last entry (90-100% use) to be comfortably in
the tail of your distribution...
If you have indications that your server spends a lot of its time with
all threads in use, you should increase the maximum number of threads
- powers of 2 are recommended.

The "ra" (read-ahead cache) field gives similar results, but for the
read-ahead cache. The first number is the size of the cache, the next
10 are a histogram showing how far into the cache entries were found
(so, the first number is the number of times an entry was read from
the first 10% of the cache), and the last is for cache misses.
Obviously, if you're getting a lot of cache misses *and* your cache
hits histogram is heavily right-skewed, it's worth increasing the
cache size. (Conversely, if you have a heavily left-skewed histogram,
and few cache misses, you may be able to manage with a smaller cache.)

The remaining fields are rpc process info fields, which are less
relevant to us for our purposes.

2. Optimising your NFS.

The most important things to ensure are that there are enough
resources for the peak load on your NFS service. NFS will spawn new
threads to handle new active connections, and if its max-threads limit
is too low, you'll get brown-outs under high load.
Starting at least four instances of nfsd per processor (and, on modern
processors, up to 8 per core) is recommended as a sensible
configuration. You can set this on the command line for the nfsd
service by simply using the bare number as an option.

And, of course, if you can bear the risk of data-loss (or silent data
corruption!) on sudden server loss, setting the export option "async"
trivially increases your network throughput by removing the need for
confirmation and syncing of writes between clients and server.
See the NFS config faq at:
http://nfs.sourceforge.net/#section_b
for more details.

You may also wish to do the standard setting of packet sizes with
respect to MTU that you would normally do for a network-based
protocol. The general process (and some more details) are covered at:
http://nfs.sourceforge.net/nfs-howto/ar01s05.html

Friday, November 06, 2009

Arc, and the installation

We've been fiddling with the NorduGrid Arc middleware a bit. Not just out of random curiosity, but more trying to get a handle on the workloads that it suits better than gLite, and vice versa. It does a number of things differently, and by running an Arc CE in parallel with an lcg-CE and CREAM, we can do some solid comparisons. Oh, and the name of the middleware is also much more amenable to puns, so expect a few groaners too.

So, consider this the first in a series. During this process, we expect to end up with a set of notes on how to install and run an Arc setup, for people already familiar with gLite.

Firstly, install. We took a blank SL5 box, added the nordugrid repo's, and then

yum groupinstall "ARC Server"
yum groupinstall "ARC Client"

Well, very nearly. There's one more thing needed, which is to add the EPEL dependancies (libVOMS is the key lib)

yum install yum-conf-epel

The next step is to configure it. That's all done in /etc/arc.conf, and is the subject for later posts.

There is a need for a filesystem shared between the CE and the worker nodes, so we fired up a spare disk server for NFS.

Startup is three systems, already configured in /etc/init.d : gridftp, grid-infosys and grid-manager.

Ta-da! A running Arc CE.

Ok, so there's a fair bit glossed over in the configuration step. Next time, I'll talk about how I configured it to work with our existing queues - and where the expectations for Arc differ from gLite.

Friday, October 30, 2009

worker node on demand

Virtualisation is a hot topic again for grid services and worker node on demand

KVM, XEN, VMWARE - Everyone using different ones.
Virtualisation for cloud - Nimbus, Open Nebula, eucalyptus

the future... ??
1. plain signed virtual images transported from site to site.
2. virtual images including experiment software.
3. connecting to pilot job frameworks, instantiated with virtual images,
4. pilot frameworks replaced by commercial domain schedulers. virtual clusters.

Monday, October 26, 2009

HEPIX is GO!

HEPIX Workshop

Site Reports Session

CERN:
Getting serious about ITIL. Solaris being phased out. Getting serious about 10GigE.
Lustre pilot project. New purchases discussed.

JLAB:
New LQCG Cluster "2009 Quad Infiniband - ARRA Cluster"
Storage - Whitebox 14 AMAXservers Solaris w/ZFS or Lustre
Compute - DellpowerEdge R4102 x4 Ghz QDR Infiiband, 24Gb RAM

Auger Cluster Upgraded
Nehalems - intel x5530 dual cpu, quad core, 24MB RAM, 500GB SATA
(seeing i/o contention on disk when running 14/16 jobs)
OS Switch from Fedora 8 32bit, to CentOS 5.3 64bit

No real Grid Computing
IBM TS3500 tape library installed. StorageTek Powderhorn silos replaced.
80 production VM's VMWare ESX3.5 planned to move to vSphere4.0

GSI:
FAIR - new accelerator discussion. The futuristic talk!
The Cube DataCentre Building: 1000 19" water cooled racks held in 26x26x26 cube building. Lifts to reach the machines. Iron structure for racks to sit on.

CINP2P3 LYON:
T1 4LHC & D0, Babar, SL5 migration in Q2 2010 for both Main Cluster and MPI Cluster. New Purchases and New Server Building.

STORAGE Session

Your File System NexGen openAFS (Jeffery Altman):
YFS now funded by US Gov to create nextgen openAFS. 2 year funding. Deliverables included assessment of current AFS and 2 year upgrade plan to client and server for YFS deliverable. Still open source.

Storm and Lustre:
IOZONE discussion, Hammer-cloud Tests Discussion, Benchmarking summary, Good Results, performance below iozone tests. WMS jobs and Panda jobs different. file::// protocol support performs well but requires the VO to support it. Open questions: Lustre Striping (should yes or no). Performance (Raid config?), Monitoring - still work to be done, Support - Kernel Upgrades can take a while to be made available and Benchmarks - are they realistic? Tuning still to do.

Lustre at GSI:
Users - Alice Analysis for Tier2, GSI Exp, FAIR Simulations. Still on 1.6.7.2 1Pbtye, > 3000 nodes. Foundry RX32 ethernet switch. MDS HA Pair, one standby. 84 OSS, 200 OSTs. MDS 8 core, 3GHz Xeon, 32Bb RAM. Real throughput testing with Alice Analysis Train. 50Gbit/s using 2000 cores. Hardware and Software issues. Complex system and vulnerable to network communications. Using Robin Hood Filesystem Monitor for audit and management. This protects the MDS by directing requests to MYSQL instance. i.e top ten users, file moves etc. Using this rather than e2Scan.

Hadoop on your worker nodes using local hard drives & Fuse:
Hadoop compared against Lustre. Performed well when 8 jobs ran. Replication of files provides redundancy. Cost and maintenance factor very favourable to small sites. Deployed in some sites in the US. Not a really Tier 1 deployable solution. Name node redundancy exists (will lose at most one transaction) - requires additional software.

Virtualization Session

lxcloud at CERN:
Cern has developed a proof of concept for virtualized worker nodes. 'Golden nodes' serving images to the Xen Hypervisors using Open Nebula. Also looked at Platform's VMO. Production lxcloud being built. 10 machines, 24GB, 2TB disk dual Nehalem. Starting with Xen. Production release by March 2010. Memory an issue as the HyperVisor requires some memory i.e. with 16GB RAM you cannot run 8 2GB VM's.

Fermigrid:
Has moved much of its infrastructure to Xen HyperVisor. Looks like a solid infrastructure. Investigating KVM with the possibility of a move in the next few years if it proves to be better. INFN mentioned Xen vs KVM at Hepix Spring 2009 for discussion of differences.

Monday, October 19, 2009

Another new VO at Glasgow

Today I finally got time to create a new VO for our new users in Solid State Physics.

vo.ssp.ac.uk

This is now active across the cluster and users can sign up to the VO from our voms server on svr029 and will be used to host users of CASTEP and other departmental SSP users.

Our local wiki page on running CASTEP at Glasgow. Only the MPI version to get working now.

Monday, October 12, 2009

CASTEP, A Test of True Grid

Along came another users with a requirement for MPI. Can we run it? Well yes you can but remember our interconnects are just plain old Ethernet and nothing fancy like Myrinet or Infiniband. We are not a HPC cluster but an HTC cluster.

So we have been building CASTEP, an f90 code, heavy on the MPI scatter/gather. A test of true grid for any HTC cluster. First off CASTEP requires a minimum of make3.81 and gfortran43. Handy that we moved to SL5 as these are now the standard. Coupled with making sure that the required libs fftw3, blas and lapack are all built with the same compiler, gfortran43. This allowed the single core version to be built and installed onto the grid.

An MPI version is turning out be a bit more work. First off the old, outdated and no longer developed libs MPICH have not been built with .f90 support enabled by default. So we have got hold of the source to do a recompile with .f90 support on for gfortran43. There also appeared to be a bug in the gfortran support. So we had to patch the src rpm to include a patch that we located online. This allowed us to finally build the mpich lib. This has been tested with compilation of an MPI job in c and f90, both of which run successfully.

Unfortunately CASTEP still doesn't run using it so more digging required.