Friday, March 26, 2010

'EventRecords' is full

Our accounting database appears to be full.
org.glite.apel.core.ApelException: java.sql.SQLException: The table 'EventRecords' is full
Hmmm, what to do. Increase or archive?

You can see what is set from: SHOW TABLE STATUS FROM accounting LIKE 'EventRecords';

and if you want to increase you can use:ALTER TABLE accounting MAX_ROWS=1000000000 AVG_ROW_LENGTH=338;

But surely the correct thing would be archive. Handily the archival procedure is documented on the APEL wiki.

It is useful to know that the default size of MyISAM tables in MYSQL4 is 4Gb. Luckily in MYSQL5 and above the table limit is much higher. I wonder if the new SL5 APEL will ship with innodb tables?

Thursday, March 18, 2010

Corralling jobs in Maui.

Sometimes, when testing new hardware or software in a limited way, it is important to be able to arrange lightweight, temporary partitions of a cluster for only a given user.
Now, you could repartition the cluster nodes between a "normal" partition and a "testing" partition, but for most pbs/maui clusters (which don't have anything but the 'ALL' partition set), this involves changing configuration for all the nodes, rather than simply the nodes we care about. (And then changing it back when you're finished.)

You might also consider doing this with reservations - indeed, the maui manual suggests that a reservation locked to a user specified with an & prefix will force precisely the behaviour we want - locking the reservation and the user together. This appears not to work under empirical testing.

Instead, the solution we've found to work is (all in maui.cfg):

  1. Create a reservation for the user only.
    SRCFG[ssdnodes] PERIOD=INFINITY
    SRCFG[ssdnodes] STARTTIME=00:00:00 ENDTIME=24:00:00
    SRCFG[ssdnodes] HOSTLIST=node30[0-9]
    SRCFG[ssdnodes] USERLIST=ssp001
  2. Create a quality of service class with the property that it only runs on that reservation.
    QOSCFG[ssd] QFLAGS=USERESERVED:ssdnodes
  3. Make the user a member of that quality of service class only.
    USERCFG[ssp001] QDEF=ssd QLIST=ssd
(In this case, the configuration mutually restricts the user ssp001 and the nodes node300 to node309 to each other.)
This has the benefit that it also generalises to any number of users, as long as you add them to the reservation and the QoS class.

Friday, March 12, 2010

SSDs - the testing begins!

This Monday (finally!) we received (half) of the SSDs we ordered for our storage testing plans.
These are the Intel G2 X-25s which are intended to represent the mid-range of the SSDs available currently (the low end ones are still due to arrive, and our high end card is being tested differently).

Just as a sneak preview, we had a chance to run iozone against one of the X-25s, in the same configuration as I've previously run against our newer disk servers (in RAID6 mode). As you can see from the graphs below, the SSDs behave exactly as we'd expect - the throughput is almost identical on random or ordered reads, whilst the RAID array suffers significantly from having to seek. Indeed, although the 22 drives in the array give it much better read performance when not seeking, the single X25 seems to equal the RAID array's performance when seeking is needed...





Next thing on the list is testing them in Worker nodes against Analysis and Production workloads.

LHCb Production Failures

Over the last week we have been investigating why we have around 50% failure rate with LHCb jobs. All seem to be failing with the same issue which is sometimes not being able to copy their results back to the Tier 0 or subsequent fail-over Tier 1 site. This is not strictly just a Glasgow issue and it has affected Sheffield and Brunel, although the issue appears to have gone away from Brunel.

We have tried pretty much everything, as simple lcg-ls and lcg-cp actually work from the worker nodes so its not a certificate issue. The failures are not particular to a CE. Nothing changed at our site prior to the failure and LHCb say nothing changed at their end. In fact they have sites in the UK such as Manchester working fine.

None of the failures correspond to a particular set of worker nodes which might indicate NAT issues for us as we split our odd and even nodes through separate NAT's. However, it does look like network contention at some point in the process as we see either broken pipes or timeouts in the logs direct from Globus.


2010-03-04 04:04:56 UTC dirac-jobexec.py ERROR: SRM2Storage.__putFile: Failed to put file to storage. file:/tmp/8230840/CREAM603030715/7472318/00005987_00009161_3.dst: globus_xio: System error in writev: Broken pipe
2010-03-04 04:04:56 UTC dirac-jobexec.py ERROR: globus_xio: A system call failed: Broken pipe


The only constant so far is that their appears to be a 50% failure rate from failed uploads which happens consistently from submissions from DIRAC.

Its certainly a puzzler and we are fast running out of ideas!

NATs Maxing Out

During our investigation of our LHCb failures we noticed that our number of conntrack entries on our two NAT hosts were in fact being totally used up i.e. all 43200! By looking at /proc/net/ip_conntrack we noticed that most of the connections were in fact udp DNS lookups by Camont jobs. We also noticed that we had not changed the default timeouts, 32768 for tcp and 3600 for udp. This was probably the reason they were being used up. So we have tweaked the timeouts and increased the maximum.
So our new NAT settings look like this:

/etc/sysctl.conf
original values of 43200, 32768, 3600 respectively.
net.ipv4.netfilter.ip_conntrack_tcp_timeout_established = 21600
net.ipv4.netfilter.ip_conntrack_max = 65536
net.ipv4.netfilter.ip_conntrack_udp_timeout = 30

Now out NAT's look much healthier. Only problem - it didn't help with LHCb productions jobs not being able to upload their results back to CERN. Back to the drawing board.

Monday, March 01, 2010

local users before pool users

Further to the original post by Graeme 'to voms or not to voms'. The Nikhef documentation has been thoroughly overhauled and I have now been able to switch lcmaps in CREAM and SCAS over to use local unix group mappings before pool accounts, if they exist.

The main changes are changing localaccount to pull in the glasgow centric grid-mapfile.

localaccount = "lcmaps_localaccount.mod"
" -gridmapfile /usr/local/etc/grid-mapfile-local"
# " -gridmapfile /etc/grid-security/grid-mapfile"

Some small tweaks are required to move localaccount from the last check to the first check. If this is successful it uses that account, otherwise it moves to check voms and pool accounts.

glexec_get_account:
proxycheck -> localaccount
localaccount -> good | vomslocalgroup
#proxycheck -> vomslocalgroup
vomslocalgroup -> vomspoolaccount | poolaccount
vomspoolaccount -> good | vomslocalaccount
vomslocalaccount -> good | poolaccount
poolaccount -> good #| localaccount

glexec_verify_account:
proxycheck -> localaccount
localaccount -> good | vomslocalgroup
#proxycheck -> vomslocalgroup
vomslocalgroup -> vomspoolaccount | poolaccount
vomspoolaccount -> good | vomslocalaccount
vomslocalaccount -> good | poolaccount
poolaccount -> good #| localaccount

SCAS is works in the same way and all that is required is to change the localaccount setting to pull in our Glasgow local grid-mapfile a'la

localaccount = "lcmaps_localaccount.mod"
" -gridmapfile /usr/local/etc/grid-mapfile-local"
# " -gridmapfile /etc/grid-security/grid-mapfile"


Job done. I can now flit between gla or pool accounts depending on my existence in /usr/local/etc/grid-mapfile-local

Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
2013.svr008 cream_441636610 ssp001 0 R q2d
2014.svr008 cream_963867097 gla057 0 Q q2d

VMware Web admin vs SL5.4: fight!

Recently, we've acquired some hefty servers for the purposes of running virtual machines (initially for test purposes and cheap dev boxes, but potentially for service hosting depending on how well it goes). We're using VMWare Server, which, although it comes with some command line tools, very much wants you to use the fancy web interface that it runs on non-standard ports.

This was fine, except that it seemed extremely flaky on all our test servers - randomly crashing, sometimes taking out a running VM with it.

It turns out that this is all the fault of our running an up-to-date version of SL. SL5.4 (actually, anything based on RHEL5.4, one assumes) has a version of glibc that VMWare really doesn't get on with well.
Once we copied the 5.3 release of libc.so.6 from a 5.3 server into a suitable place, and pointed VMware's LD_LIBRARY_PATH at it, it seems much more stable.

(The relevant bug report, including fix suggestions is:
http://bugs.centos.org/view.php?id=3884 )