Friday, November 27, 2009

Mysql binary logging revisited

After last time, I'd poked at the LB servers databases, so we were getting effectively lock free backups, on one of the servers.

In the intervening period, after it was seen to be stable, I did the same for the other server.

However, Mike noted that the disk space used for the logs was growing rapidly. (I blame those pesky LHC physicists. Running jobs on our systems - anyone would think there was data to analyse or something ...). Because we're running with LB servers on the same machines as the WMS, this means that the /var partition contains both the database files, and the users sandbox - hence the old binary logs take space away from users stuff. (That's something to think about for the reinstall - might be worth separating them).

Time to automate log triming. Firstly, the manual side: the statement

PURGE BINARY LOGS BEFORE '2009-11-01 00:00:00';

to the server trims out some of the older logs. You can also trim up to a given file.

Better than that, however, is to put

expire_logs_days=8

in the my.cnf. This tells mysql to retire logs older than 8 days at server start up, or when the logs are flushed.

So as long as we ensure that when we take a full backup we flush the logs, then logs are automatically trimmed to just over a weeks worth. Adding that parameter to the mysqldump script, and we're done.

The binary logs have value, independant of the backups - there's a tool to read them, and look at what was happening. Whether 8 days is the best level for us is something that we'll have to monitor - arguments for shorter time periods seem stronger than for longer.

Wednesday, November 25, 2009

Torque 2.4.2 to the rescue

I previously blogged about our Torque 2.3.6 on SL5 mom's continually seg-faulting. At first we thought it was a bitness issue 32/64 between our SL4 and SL5 mom's running through the same pbs_server. However, a quick test with the SL4 nodes removed proved that this was not the case. A trawl through the source proved unproductive.

Therefore, it was time to go to Plan B. To that end I have built the latest Torque release 2.4.2 and tested this on our pre-prod staging cluster. This worked well with a configuration of 2.3.6 server and 2.4.2 mom's. The next test was a test on a single node in production. This was successful and was running jobs fine when all the other mom's seg-faulted again. The 2.4.2 mom survived this and continued to run. So a full roll-out is under way. We will think about upgrading the server at a later date. The only point to note is that we have to fully drain a node before doing the upgrade which is pain. It does attempt a job conversion but these are unsuccessful as far as we can tell and you end up with dead job holding onto job slots.

So the moral of the story is stay away from 2.3.6 and go to 2.4.2 instead.
It is pretty easy to build but I have hosted our build here for anyone that wants them.

Tuesday, November 24, 2009

A tale of two job managers

A while back I posted about supporting SL4 and SL5 OS's through the same batch system. Our solution was to use torque submit filters to add additional node properties to the jobs as they passed through the job managers on the CE's. This coupled with specific node properties on all nodes on the cluster worked quite well until I noticed that we were leaking CREAM jobs that should have requested SL5 running on the SL4 nodes.

After some investigation it appeared that when I was testing the filter and running the cream pbs qsub submit by hand I was always setting the number of nodes, even if I only required 1 i.e.
 
as a pool account ....
/opt/glite/bin/pbs_submit.sh -c /bin/hostname -q q1d -n 1

This meant that there was always a #PBS -l nodes= in the submission script. However, if you call pbs_submit without the -n you get behaviour where no #PBS -l nodes= line appears in the final submission script. This then relies on the default behaviour that if no number of nodes is specified you get 1 node. This meant that my pbs filter did not catch the number of nodes and did not add the node property at all!

As it turns out on deeper investigation into the CREAM pbs_submission script. That when 1 node is required it uses the pbs default behaviour and does not specify a number of nodes. Only when there is more than one does it specify this i.e. MPI. This is a change from the lcg_CE job manager which always specifies a number of nodes be it 1 or more. Something to remember.

To get round this I have added an additional line to the cream pbs submit script to always default to 1 node if not MPI. Not the best but it's a short lived tweak until we get rid of our SL4 support. This should be very soon.

/opt/glite/bin/pbs_submit.sh

[ ! -z "$bls_opt_mpinodes" ] || echo "#PBS -l nodes=1" >> $bls_tmp_file

Thursday, November 19, 2009

CE Publishing

The Problem

Publishing an inhomogeneous site 'correctly' is not trivial. This is now required in order to pass the new gstat2 Nagios tests. Things to remember -

* Physical is sockets/CPU's and Logical is Cores.
* Physical * Cores = Logical in order to pass the new central Nagios tests.

If your cluster is inhomogeneous then you need to be able to publish both clusters separately or as one or come up with a fudged number. It is made harder as we have one batch system with multiple CE's submitting to it.

Some Solutions

* Sub-Clusters [ what we have implemented at Glasgow ]
* Publishing decimal for cores

our implementation is discussed here.

Please let me know if anything is wrong with this and I will update.

Segfaulting PbsMoms

We have an issue with segfaulting mom's that seems correlated with the server trying to ping it's moms. The server are version is torque-2.3.6-2cri.x86_64
We are currently supporting two OS's through the same batch system using submit filter and node properties. Therefore, we have two different versions of moms.
Nodes 1->295 have moms torque-2.3.6-2cri.x86_64 and 296->309 have moms torque-2.1.9-4cri.slc4.i386

When the moms segfault we see that the torque-2.1.9 moms stay up and only the torque-2.3.6 moms all die. I ran one of them through GDB and can see the call stack:

(gdb) where
#0 mom_server_find_by_ip (search_ipaddr=177078032) at mom_server.c:450
#1 0x000000000041965e in mom_server_valid_message_source (stream=0) at mom_server.c:2022
#2 0x0000000000419870 in is_request (stream=0, version=1, cmdp=0x7fffff542ae8) at mom_server.c:2125
#3 0x0000000000416997 in do_rpp (stream=0) at mom_main.c:5351
#4 0x0000000000416a52 in rpp_request (fd=) at mom_main.c:5408
#5 0x00002ae8ae9f3bc8 in wait_request (waittime=, SState=0x0) at ../Libnet/net_server.c:469
#6 0x0000000000416c1d in main_loop () at mom_main.c:8046
#7 0x0000000000416ee1 in main (argc=1, argv=0x7fffff5431d8) at mom_main.c:8148
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) n
Program not restarted.
(gdb) bt full
#0 mom_server_find_by_ip (search_ipaddr=177078032) at mom_server.c:450
__v =
pms = (mom_server *) 0x6cbb80
addr =
#1 0x000000000041965e in mom_server_valid_message_source (stream=0) at mom_server.c:2022
addr = (struct sockaddr_in *) 0x187ef434
pms = (mom_server *) 0x0
id = 0x43be08 "mom_server_valid_message_source"
#2 0x0000000000419870 in is_request (stream=0, version=1, cmdp=0x7fffff542ae8) at mom_server.c:2125
command =
ret = 0
pms =
ipaddr =
id = "is_request"


So it looks like time to dive through the source for mom_server_find_by_ip (search_ipaddr=177078032) at mom_server.c:450 or install torque-2.4!

Tuesday, November 17, 2009

Arc, authorisation and LCMAPS

As a gLite site, it would be ideal if we could have the same user mapping between certificate DN's, and unix user names that is used with our existing CE's.

Which means using the gLite LCMAPS to make decisions about what username each user has.

This is supported in Arc, but it's not in the same fashion.

The best approach appears to be: Have an initial mapping listed in the grid-mapfile (There's utilities to make this easy). This allows a first pass of authorisation. Then, in the gridFTP server, the mapping rules in there are applied next - this is where LCMAPS applies.

Interestingly, Arc makes it very easy to do the thing we found hard with LCMAPS - to have a small set of 'local' users with fixed permanent mappings (independant of VO), and VO based pool accounts for other users.

However, it's in the LCMAPS integration that things get a bit stuck.

It's a silly 32/64 bitness issue. On a 64 bit system, yum pulls out the 64bit Arc - as you might expect. Sadly, there's not a 64 bit version of LCMAPS in the repositories as yet.

So it's a case of hacking what I need out of etics. I'll post a recipe when I have one, but this is a pretty tempory situation - it looks like Oscar pretty much LCAS/LCMAPS ready, but they're not a separate package, so are waiting on the SCAS, CREAM or WMS SL5-64bit packages.

Wednesday, November 11, 2009

nmon

Seeing sams post about NFS prompted me to mention 'nmon' - its kinda like 'top' on steroids and does particularly useful trend plotting. Originally a hack for AIX but ported to linux once a certain vendor realised people weren't just buying powerpc systems....

Anyway - go grab from http://www.ibm.com/developerworks/aix/library/au-analyze_aix/ - the linux version is now opensource I see.

NFS Load Tweaks: a Brief Guide for the Interested Enthusiast

I was asked about the mystery of NFS server tweaking in a dteam meeting, so I thought I'd compile this brief blog post.
As with all actions, there are two steps: first, gather your information, second, act on this information.

1) Determining your current NFS load statistics.

NFS logs useful information in its /proc entry...

so:

> cat /proc/net/rpc/nfsd

rc 0 28905480 1603148913
fh 133 0 0 0 0
io 3663786355 2268252
th 63 362541 16645.121 3156.556 747.974 280.920 148.129 100.155 61.480
42.249 40.829 90.461
ra 256 1069115586 4089582 3055815 2625032 2228952 2114496 1983622
1765372 1743563 1610465 89609536
net 1634942152 0 1634971040 2214677
rpc 1630024431 0 0 0 0
proc2 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
proc3 22 1573543 1535237104 8743056 1545350887 1532645717 29571823
1179900114 9214599 6691508 538717 366274 0 2801854 39816 505310 4298
2486034 62181794 53164 2414727 0 986878
proc4 2 0 0

This somewhat arcane looking output is full of variously useful
statistics about your nfs daemon.

The "rc" (read cache) field gives the fraction of cache hits, misses
and "nocache" (interactions which bypassed the cache) for read
operations.

The "fh" (file handle) field's most important entry is the first - the
number of stale file handles in the system. If you have flaky NFS, for
example, this will be non-zero.

The io field is simple cumulative io (read, and then written) in bytes.

The "th" (threads) field is the most interesting field for NFS load
optimisation. The first entry is the total number of threads currently
executing. The second is the number of seconds (?) all threads were in use
(which means your NFS was maxed out in active connections). The
remaining 10 entries are a histogram of NFS thread utilisation, in
seconds (it seems to be hard to get NFS to reset this; restarting the
daemon definitely doesn't). Plotting this gives you an idea of how
much time your NFS server spends in various load states.
Ideally, you want the last entry (90-100% use) to be comfortably in
the tail of your distribution...
If you have indications that your server spends a lot of its time with
all threads in use, you should increase the maximum number of threads
- powers of 2 are recommended.

The "ra" (read-ahead cache) field gives similar results, but for the
read-ahead cache. The first number is the size of the cache, the next
10 are a histogram showing how far into the cache entries were found
(so, the first number is the number of times an entry was read from
the first 10% of the cache), and the last is for cache misses.
Obviously, if you're getting a lot of cache misses *and* your cache
hits histogram is heavily right-skewed, it's worth increasing the
cache size. (Conversely, if you have a heavily left-skewed histogram,
and few cache misses, you may be able to manage with a smaller cache.)

The remaining fields are rpc process info fields, which are less
relevant to us for our purposes.

2. Optimising your NFS.

The most important things to ensure are that there are enough
resources for the peak load on your NFS service. NFS will spawn new
threads to handle new active connections, and if its max-threads limit
is too low, you'll get brown-outs under high load.
Starting at least four instances of nfsd per processor (and, on modern
processors, up to 8 per core) is recommended as a sensible
configuration. You can set this on the command line for the nfsd
service by simply using the bare number as an option.

And, of course, if you can bear the risk of data-loss (or silent data
corruption!) on sudden server loss, setting the export option "async"
trivially increases your network throughput by removing the need for
confirmation and syncing of writes between clients and server.
See the NFS config faq at:
http://nfs.sourceforge.net/#section_b
for more details.

You may also wish to do the standard setting of packet sizes with
respect to MTU that you would normally do for a network-based
protocol. The general process (and some more details) are covered at:
http://nfs.sourceforge.net/nfs-howto/ar01s05.html

Friday, November 06, 2009

Arc, and the installation

We've been fiddling with the NorduGrid Arc middleware a bit. Not just out of random curiosity, but more trying to get a handle on the workloads that it suits better than gLite, and vice versa. It does a number of things differently, and by running an Arc CE in parallel with an lcg-CE and CREAM, we can do some solid comparisons. Oh, and the name of the middleware is also much more amenable to puns, so expect a few groaners too.

So, consider this the first in a series. During this process, we expect to end up with a set of notes on how to install and run an Arc setup, for people already familiar with gLite.

Firstly, install. We took a blank SL5 box, added the nordugrid repo's, and then

yum groupinstall "ARC Server"
yum groupinstall "ARC Client"

Well, very nearly. There's one more thing needed, which is to add the EPEL dependancies (libVOMS is the key lib)

yum install yum-conf-epel

The next step is to configure it. That's all done in /etc/arc.conf, and is the subject for later posts.

There is a need for a filesystem shared between the CE and the worker nodes, so we fired up a spare disk server for NFS.

Startup is three systems, already configured in /etc/init.d : gridftp, grid-infosys and grid-manager.

Ta-da! A running Arc CE.

Ok, so there's a fair bit glossed over in the configuration step. Next time, I'll talk about how I configured it to work with our existing queues - and where the expectations for Arc differ from gLite.