Thursday, May 31, 2007

Users and Stalled Jobs

I noted that I had qdeled quite a few jobs from the cluster two days ago. Well, the ILC user I contacted through the CIC portal responded, apologised and thanked me for the suggestion of adding a timeout to lcg-cp. However, the biomed user (110 jobs) didn't answer. And in fact they continued to submit jobs onto the cluster and I was forced to clear out another 42 jobs this afternoon. So, if nice doesn't work, try nasty - I have ticketed them and will ban them from the cluster if I don't hear back within 24 hours.

Wednesday, May 30, 2007

ScotLUG talk

Just a FYI - I shall be speaking at the Scottish Linux Users' Group on Thursday evening giving a brief (3 slides) introduction to "The Grid" from a WLCG viewpoint, the computing demands, andthen moving on to how we use cfengine and caffeine to keep sane (more than 3 slides).

Not that it's a rehash of my HEPiX talk or anything :-)
Details will be on http://www.scotlug.org.uk/wiki/2007-05-31

VOMS is a Good Thing (tm)

Spent some of yesterday playing with VOMS and the static grid-mapfile mappings we put into the system. The point of these was to support users who were not yet in any VO and might be using globus-job-* commands to submit jobs to the cluster. For these users it was obviously important that they mapped to the same account on the batch system as on the UI. (What actually happens is that the UI mirrors its grid-mapfile from the CE.)

However, we now have a different class of user, one who might use gsissh to access the cluster, but use the RB to submit jobs and be in a VO. For this user we need a local mapping in the grid-mapfile, but we need to ensure that job submission is made with a normal VO mapped pool account.

To test this I mapped myself to a local account (gla012) on the UI, using gsissh to login. Then I initalised a VOMS proxy and submitted a job. The job correctly ran as dteam.

This means we can support our multiple classes of users, and also users in multiple VOs, in a relatively straightforward way.

One point for the local account, though, is that we will put the local account into the correct group for the user's expected VO. This means that job submission with a vanilla proxy will work as expected.

(An outstanding problem might well be data write access - at the moment the data users put onto nfs using their local account is fine for writing, but the area cannot be written to. This may be an issue for some people.)

Tuesday, May 29, 2007

The Road To Hell Is Paved With Data Management...


After running rather full for more than a week, we had lots of jobs which seemed to be hanging. On investigation most of these were due to stuck data management commands - mostly lcg-cps. While lcg-cp failed to exit the job was idling as it slowly approached its max wallclock when it would be killed.

As we had a queue of jobs which wanted to run it seemed absurd to leave these slackers until the batch system sent them to their fiery doom. Better to kill them off early and get some well behaved jobs in.

With a small minority of ilc jobs I killed the lcg-cp by hand, and a few of them managed to restart properly. However, when I discovered a biomed user with 110 jobs hung it was going to be far too tedious to try and script checkjob, ssh, grep and kill to attempt to jump start them - so I wielded my trusty qdel.

Unfortunately, qdeling 110 jobs at the one time produced a rather fierce load spike on the CE. The GRIS plugin then timed out and we spasmed into 4444 waiting jobs and the usual 68 years of ERT. Woops! Next time I'll delete them more slowly.

I contacted the users via the CIC portal interface that Alessandra had suggested. For this purpose it seems to work rather well.

Finally, after all of that, we got 3 jobs that were sitting in "W" state. No ideas, nothing useful in the PBS logs and many other things to do, so finally I had to qdel them as well.

It takes a long time to clear out this stuff - a couple of hours at least. Fortunately it doesn't seem to happen too often.

Transfer Channels

Back to 'proper' work, and I realised that the transfer-channel sizes are still very conservative (with the exception of Imperial). Rather than just increasing to a high value, I thought I'd plot some timings to show the difference of a 0.5TB FTS session.

The 4 steps on the image show: seeding to ral-T2 with lcg-cr from local workstation, then 3 sets of transfers up to uki-scotgrid-glasgow with the files setting on the transfer channel to 5,10 and 15 respectively.

My script *should* have continued with larger steps, but some of us forgot to renew the voms-proxy...

Friday, May 25, 2007

Fiddly old VO configurations

Got an email from a Zeus VO member who couldn't use the SE. Poking around revealed that we hadn't updated the VO membership list to be obtained from VOMS, instead of the old LDAP system.

As YAIM is now in a mess, I had to modify /opt/edg/etc/edg-mkgridmap.conf and /opt/lcg/etc/lcgdm-mkgridmap.conf by hand and then rebuild the mapfiles.

The should fix things, but I so, so wish we had a one-stop-shop for configuring VOs. It's too easy to leave VOs half cocked on the system, and often it's months before anyone notices or complains.

Of DPM, MySQL and MonAMI... (part 2)

Paul put MonAMI back onto the DPM yesterday. We saw a very similar rise in the number of MySQL connections as before, but as we were on the ball with this we were able to look at who was connecting via SHOW PROCESSLIST. Turns out that all the extra connections were from DPM itself. MonAMI was not to blame.

Early this morning the number of connections came back down again, which might indicate that under certain circumstances, DPM starts an extra connection to the database which it then does not let go of for some time (the 24 hour slot for any SRM transaction to complete?). I wonder if this might be the cause of the rare putDone failures we saw.

Thanks to MonAMI we'll be able to watch for this, and correlate any failures with how busy MySQL was.

Paul also did some pretty RRD aggregate plots, which are very much easier to read. Thanks! Note how MonAMI is able to distinguish between atlas and atlas/Role=production, which is incredibly useful.

My name is total (well, maybe...)

I heard from Hannah today that the problems using the LFC at RAL for totalep were down to DNS problams (globus being very picky about hostnames). She's managed to overcome them by hard coding the LFC's hostname into /etc/hosts, or by using a different machine on a subnet with better DNS. Hopefully we'll see some jobs soon.

Spaeking of jobs from minor VOs, I noticed that camont sent quite a few jobs in the last week. In total they've done nearly 500 jobs and consumed about 450 CPU hours (812 kSI2k hours).

Thursday, May 24, 2007

Users: The Good, The Bad and The Ugly...

Last Thursday I went to see a few potential ScotGrid users. There are a couple of users in Civil Engineering who'd like to use the cluster. One of them has monte carlo code and this should be no problem (the good!) Another has a real need for MPI code (the bad!), which is challenging, but we obviously hope to build on the good work done here in GridIreland (see their wiki).

Finally, I spoke to a postgrad student in IBLS (Institute of Biomedical and Life Sciences) who has a lot of code to run on protein data (don't really understand the problem, actually). Now, she's got to run code she hasn't written herself, and neither does she have the source code and the damn thing asks questions in an interactive mode before it runs. (The ugly!)

On the computing service cluster she has to run the code in PBS interactive mode, in order to answer the questions. Then she has to hope that the network stays up between her machine and the cluster, because if it goes down (and often it does) it kills her job.

The first thing I did was show her how to use screen, so that at least she can run in a detached terminal. (I found a good tutorial.)

Last night, I started to write a python wrapper for the program which will take a defined set of default options, which can then be overridden on the command line, and pass them to the program. This turned out to be quite troublesome - using the python Popen3 class (pipe open) I just could not get the read() or readline() methods to behave properly with the select() call (even in nonblocking mode). select() would return a filehandle with data only once; and neither did the write() call seem to push the output properly into the program. However, when I switched to using the os.read() and os.write() calls, instead of the class methods, things started to work exactly as I expected. Hopefully this will run the program properly in a non-interactive mode and open up the possibility of running on our cluster.

Of course, there's a different track to this as well - we urgently need to get VOs setup for these users so that they can use edg-job-* commands or ganga to run their jobs. Watch this space...

Jabber Dabba Do!

OK, So in between taking the kids to an Intellectual day out (The new Barnstormer's ace BTW) I have registered a new gmail account for uki.scotgrid.glasgow and played with Net::Jabber on svr031. Took quite a few Perl dependancies to get it working, as you need the IO SSL working for when google switches to TLS.

The supplied nagios "notify_via_jabber" doesn't work out of the box, but I have a simple test script by Thus0 I lifted from the web that works fine - Now all I need to do is rewrite the notify script with the correct incantations from the test one.

Plan is then to have Jabber notifications for certain classes of nagios alert.

UPDATE: Success! added a new test on the disk servers as I knew there were some non-dpm boxes amongst them. Screenshot of gaimPidgin popup below

Monday, May 21, 2007

Of DPM, MySQL and MonAMI...

Paul has installed MonAMI onto out DPM, which has been very useful (and will become more so when we get nagios running again). However, we started to report zero storage over the weekend, which was tracked down to MySQL running out of connections (as DPM doesn't have a monitoring API we have to query the db directly, which is not ideal). When I looked in detail I (eventually) found that MonAMI had eaten all of the MySQL connections by swallowing sockets.

Paul is investigating and seems to have found at least one place where connections could leak (although he's unclear why it was triggered).

However, even stopping MonAMI at 11pm last night didn't entirely resolve the situation. At some point in the early hours MySQL seemed to again run out of connections. This caused some of the DPM threads to go mad and write as fast as they could to the disk. By 6am there was a 2.5GB DPM log file and / was full. Yikes.

This morning I had to stop all of DPM and MySQL, move the giant logfile out of the way, and then do a restart.

Paul will try the fix soon, but this time keep a much closer eye on things.

I believe we should also make sure that /var/log on the servers is a separate large partition in the future. Although we have enough space in / during normal running, clearly an abnormal situation can fill things up pretty quickly - and running out of space on the root file system is not desirable!

Friday, May 18, 2007

iperf on Fibre


We now want to 1) Retest bandwidth to check nothing's broken


So, one quick repeat of the iperf tests that greig and I did earlier and it all looks similar. Still got the asymmetry though. Need to find a chunky machine elsewhere and test.

March of the fibre...

The new ScotGrid cluster was connected directly by fibre from the Nortel switch stack down to the campus WAN router this morning. This cuts out 10m of Cu and a hop through an old baystack router. David reports there was ~2 minutes of network outage, which hopefully no one noticed.

This should open the way to getting a dedicated 1Gb link right to SJ5.

We now want to
  1. Retest bandwidth to check nothing's broken
  2. Move the gridmon box into the new cluster area
Right, come on LHC, where's that data...

Thursday, May 17, 2007

A Morning With Ganga

I (finally) downloaded and installed ganga onto one of the cluster UIs. It's a cinch to install, just run one python script, add an element to your PATH and you're away.

The next step is going through the User Guide. Although this was written for 4.2, there's nothing I found which didn't basically work in 4.3.1.

It's quite easy to setup and run simple jobs:

In [2]:j=Job(application=Executable(exe='/bin/echo',args=['Hello, World']))

In [3]:j.submit()
Ganga.GPIDev.Lib.Job : INFO submitting job 0
Ganga.GPIDev.Adapters : INFO submitting job 0 to Local backend
Ganga.GPIDev.Lib.Job : INFO job 0 status changed to "submitted"
Out[3]: 1

In [4]:outfile=file(j.outputdir+'stdout')
Ganga.GPIDev.Lib.Job : INFO job 0 status changed to "running"
Ganga.GPIDev.Lib.Job : INFO job 0 status changed to "completed"

In [5]:print outfile.read()
Hello, World

However, that's spawning a job which just runs on the local machine - how easy was it to run on the grid?

Answer: very easy!

In [39]:l1=Job(backend=LCG())
In [42]:l1.application=Executable(exe='/bin/echo',args=['Hello, World'])

In [44]:l1.submit()
Ganga.GPIDev.Lib.Job : INFO submitting job 3
Ganga.GPIDev.Adapters : INFO submitting job 3 to LCG backend
Ganga.GPIDev.Lib.Job : INFO job 3 status changed to "submitted"
Out[44]: 1

In [45]:l1.status
Out[45]: submitted

In [46]:l1.backend
Out[46]: LCG (
status = 'Ready' ,
reason = 'unavailable' ,
iocache = '' ,
CE = None ,
middleware = 'EDG' ,
actualCE = 'ce.ulakbim.gov.tr:2119/jobmanager-lcgpbs-dteam' ,
id = 'https://svr023.gla.scotgrid.ac.uk:9000/-wOjj5xKfjcrKIXiCEISNA' ,
jobtype = 'Normal' ,
exitcode = None ,
requirements = LCGRequirements (
other = [] ,
nodenumber = 1 ,
memory = None ,
software = [] ,
ipconnectivity = 0 ,
cputime = None ,
walltime = None
)
)
In [52]:
Ganga.GPIDev.Lib.Job : INFO job 3 status changed to "completing"
Ganga.GPIDev.Lib.Job : INFO job 3 status changed to "completed"

In [54]:print file(l1.outputdir+'stdout').read()
Hello, World

Wonderful! Didn't have to do any of that nasty edg-job-* stuff. And it ran in Turkey - pretty cool.

I have also now discovered how to define and submit batches of jobs to the grid. This snippet defines a set of 10 jobs:

a=list()
for i in range(10):
a.append(Executable(exe='/bin/echo', args=[str(i)]))
s=ExeSplitter(apps=a)
j=Job(splitter=s,backend=LCG())
j.submit()

Submitted that and I'm running jobs in China, Italy, Greece, Pakistan, Russia, Austria, France, Spain and Switzerland.

I think this is the first time in a while I've thought "Hey! The grid is actually cool."

Urgent Updates Urges

The latest gLite update (r24) is labeled as urgent. Turns out that this is only because the VOMS certificate for lcg-voms.cern.ch is going to expire at the end of the month.

Forunately, we have the lcg-vomscerts RPM directly controlled by cfengine, so it was a simple matter to update this part of the system.

Of course, while we're languishing at r20 the world is marching on, so we'll have to play catch-up sometime. Currently I've sheduled next Thursday as our site upgrade day. I'm glad we have the cluster independent of any other authentication system - it means finding all of those new UIDs for sgm and prd accounts will not be a problem.

(I'm also glad we didn't break our DPM with the current fiasco over gridmap file paths!)

Wednesday, May 16, 2007

Local Accounting Pages Ready


Billy's been doing a grand job knocking the local accounting pages into shape. This is based on Jamie's original work, but with some of the nastier hacks taken out and a lot of MySQL/PHP performance improvements from Andrew.

We can now see job numbers, CPU times, wall times and efficiencies for each group, plotted on a day/week or month basis.

There's still some work to be done - it would be nice to have a per-user plot, but the core is there and working well.

Oh, and it's checked into subversion finally. No more panics about losing the code.

It's probably in a good enough shape that other sites would find it useful now, actually.

Biomed get busy


The cluster got busy again from yesterday morning, with a whole pile of biomed jobs coming in. It was nice to see the resources being used.

Strangely, users are a bit like buses - you wait ages and then two come at once, because one of our local theorists then submitted several 100 jobs last night too, so we had a job queue for the first time in ages.

Looks like everything ran successfully as well (and Steve's ATLAS jobs still got through) - I always worry that something subtle has broken which will only be revealed when the site gets busy.

I redid the fairshares on the fly though, because now that the theorists have decided to use the pheno VO, we have to reflect their 20% nominal fair-share in Maui (in fact everyone's getting 33% as CompSci and local Bio users are not yet active).

The biomed jobs are still coming in steadily - 2 or 3 a minute. Lovely jobs, actually. Run time is ~4 1/2 hours, so a good turn over rate - and they are 99.99% efficient!

Thursday, May 10, 2007

ScotGrid Review Documents Complete



The site responses for the ScotGrid T2 review have now been given to the reviewers. Inspired by Olivier I decided that some plots of CPU delivery per VO and per site would be useful.

This turned out to be surprisingly hard to do - the accounting portal only gives a summary for a time period, not a plot over the time period. So I had to download the last 12 months as individual CSV files and parse them. Of course, each file contains variable numbers of VOs and sites. As this is essentially data in 3 dimensions, i.e., cpuhours(month, vo, site) it's impossible for Excel to deal with it directly.

Time to bring python out of the box to parse the data and print summary CSV files which Excel can do. Took the best part of 3 hours - however, it's now done and any future work like this should be faster.

Plots shown above, just so they get a wider audience.

Glasgow disk benchmarks

I repeated the Edinburgh benchmarks testing at Glasgow. Results can be found here:

http://www.ph.ed.ac.uk/~gcowan1/glasgow-disk-benchmarks-07-05.ps

The behaviour for random reads and writes is very similar to that observed for the Edinburgh disk, although the absolute rates at Glasgow are higher. The sequential behaviour is somewhat different. A single thread gives a substantial perforamce boost for both reads and writes. Filesytem was ext2 with default options. dump2fs reported the large_file feature.

Edinburgh and ATLAS FCR

Edinburgh got bumped by ATLAS for failing JS for quite a while, but as their 5 jobs slots were tied up with Zeus jobs the ATLAS SAM tests couldn't run so they couldn't get back in the ATLAS BDII. Of course this meant that Steve Lloyd's jobs were aborting (no matching resources).

So, Sam has allowed the atlassgm user access to the reserved ops/dteam job slot. This will be fine as long as nothing happens to stick that job in the queue.

However, it's all getting a bit silly - more and more tweaks in the batch system to rush through the test jobs.

Of course, to some extent it's an Edinburgh problem - they are so strapped for CPU that getting anything to run through in a guaranteed time is hard.

Wednesday, May 09, 2007

I <heart> mod_include

Ah, the joys of Apache. More modules than you can shake a stick at. Trying to debug a really annoying masonic error message of "[error] Re-negotiation handshake failed: Not accepted by client!?" (SSL not working fully) I decided to set up a few "echo var" statememts in the page to see what was defined.

Lo one quick
<h1>Danger, Will Robinson! <!--#if expr="$SSL_CLIENT_S_DN_CN" --><!--#echo var="SSL_CLIENT_S_DN_CN" --> seen approaching!<!--#else -->This is <b>svr031</b><!--#endif --></h1>

and you end up with either

Danger, Will Robinson! This is svr031
or
Danger, Will Robinson! andrew elwell seen approaching!
:-)

Saturday, May 05, 2007

Edinburgh disk benchmarks

I ran some Edinburgh disk benchmarks over the past couple of days in order to obtain a comparison of the performance of the IBM disk to the university SAN. It's not an entirely fair comparison as the IBM disk uses fibre channel to talk to pool1 (beefy 16GB RAM server) while the SAN uses FC to talk to pool2 (2GB of RAM). Both systems as configured to use RAID5, but they use a different number of disk in each array. You can find the results here:

http://www.ph.ed.ac.uk/~gcowan1/edinburgh-disk-benchmark-07-05.ps

I used tiobench (threaded IO) to perform the testing. It is quite clear that there is a limit of ~100MB/s on sequential operations with the IBM disk, while for the SAN it appears to be ~50MB/s. I need to check if this is a limit with the FC connection, I had thought we should see better performance. As expected the rate for random IO increases as the block size increases. There does not appear to be any significant difference between the two sets of disks when looking at the random metrics.

I'll post some results from Glasgow soon.

Friday, May 04, 2007

I love my BDII ;-)

Glasgow survived the RAL BDII outage of last night, as our trusty svr019, which is our top level BDII, was not affected.

You can see all the red splurged across Steve's analysis test with only a few sites not affected.

Of course, the BDII is always a single point of failure - and one day our will go wrong, but at the moment, for Glasgow, it's certainly the right choice.

Mark has switched Durham to use bdii.scotgrid.ac.uk and survived the mid-morning wobble.

Until we know for sure that RAL have sorted this I have encouraged Edinburgh to switch as well.

Scalpel, Mr Elwell please...

We started the re-install of svr031 this morning shortly before 11. So far the patient is doing well, coming around after surgery before 1pm.

cfengine is running again. ganglia is installed and Andrew's hacking in the ram disk for the rrds.

So far, so good...

Thursday, May 03, 2007

Total Checks Out

The problem with VOMS was that the DN of the gridpp VOMS server was wrong in the setup.

Once I realised that, it all started to work.

Results were that it all works! lcg-cr and lcg-del function perfectly, so it seems the problem is with Hannah's UI.

Graeme Becomes an Oil Man

To help Hannah at Total with LFC issues I have joined the totalep VO.

Unfortunately things then fell over at the first hurdle - voms-proxy-init fails for this VO:

ppepc62:~$ vpi -voms totalep
Enter GRID pass phrase:
Your identity: /C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=graeme stewart
Creating temporary proxy .................................... Done
Contacting voms.gridpp.ac.uk:15026 [/C=UK/O=eScience/OU=Manchester/CN=voms.gridpp.ac.uk/Email=hostmaster@hep.man.ac.uk] "totalep" Failed

Error: Could not establish authenticated connection with the server.
GSS Major Status: Unexpected Gatekeeper or Service Name
GSS Minor Status Error Chain:

an unknown error occurred

None of the contacted servers for totalep were capable of returning a valid AC for the user.

I have put a ticket in (Footprints ISSUE=1207 PROJ=14), but no news yet.

Tuesday, May 01, 2007

HEPiX Summary

This week I 'ave mostly been attending Spring HEPiX 2007

I've summarised below some of the things I picked up on from the talks - Both slides and the video streams are available in the DESY Indico system.

Day 1 - Site Reports
LAL
WN's running SL4.4 64 bit with the ability to "fix" a VO to SL3 nodes
Evaluating LUSTRE
Lemon/Quattor for monitoring
Sun Thumper good.

LAPP (French T3)
Use GPFS/Woodcrest Blades (HP)
Cacti / Nagios / Ganglia
Developing own accounting tools for parsing Torque / Maui logs

PSI.ch
FreeNX desktops
SL5 Live CD
v12n - Use VMware / Eval Xen.
Use Lustre (on Cray) and GPFS (Linux clusters)

CASPUR
Mix of systems - IBM P5 / 2*Opteron clusters (Infiniband / Qsnet) / NEC SX6 / HP EV7.
GPFS / AFS / NFS
Xen / VMware
BNL
2*8500 robots using HPSS
Linux farm is big (>4700 CPUs, 1.3PB Local disk)
Nagios / Ganglia / Cacti
Temp Monitors distributed in datacentre - alert to RT Tickets.
Use both RT (helpdesk) and AT (assets) - also linked to OSG Footprints

GSI
OTRS Ticketing
Debian
1U boxes with 16 cores/32G RAM / 4 SATA Hot Swap disks for 8K Euro

RAL
2nd 8500 silo, perhaps with tape-passing interface
CASTOR problems
Disk problems fixed
Network upgraded

GridKa
Procurement via benchmark
How to connect 10GE to storage.

PDSF
HPSS
New Building
Blades

ScotGrid
Hey, We're great.

TRIUMF
Using BGP within the ATLAS network!
In-Row cooling
Dirvish / Amanda backups
Funky Videoconference room

INFN-T1
GPFS
RedEye (DIY) Monitoring System
Disk0Tape1 is CASTOR-2, D1T0 - GPFS / StoRM, D1T1 probably CASTOR-2
Nagios / MRTG / ntop

IN2P3
8500 robot
HPSS

CERN
Magnet issues
Firewall changes (no UDP)
Procurement via benchmark
Clampdown on Skype / p2p
Silent Data corruption
SLC3 support. SLC5 tests.

DESY
New Machine room
Nagios
8500 robot
Request Tracker / Zope

SLAC
Sun BlackBox, Thumpers, 8500


Phew. I'll write up the other days ASAP.