Monday, February 27, 2012

LSC files and emailAddress redux

This post involves a very complicated journey to get to a simple place.

The fundamental problem is around the catchy titled OID 1.2.840.113549.1.9.1

No, wait, let me take a step back. On the Grid, we use certificates for authentication. An X509 certificate is, as with most certificates, a signed set of assertions, and a public key. As with the rest of the X500 standards, it's native language is something called ASN.1 (Abstract Syntax Notation 1) (aka X208, and the later revision X680), held in files encoded by the DER (Distinguished Encoding Rules).

The fundamental takeaway from that tech-dump is that X509 certificates are not in plain text, and there are multiple standards required in order to understand their contents.

So when someone says their certificate Distinguished Name is '/O=SomeUni/OU=SomeDept/L=group/CN=JohnSmith' ... that's not quite accurate. What they really mean is that there certificate DN is some set of objects that can be unambiguously matched to that ASCII text.

That happens because there are universally agreed mappings between the actually stored OID and the text representation of them (e.g. CN is OID 2.5.4.3).

Unfortunately, the agreement breaks down a bit for the emailAddress field; with some software mapping it to Email, and others to emailAddress. By the PKCS#9 standard, one could argue that it should be emailAddress - but that doesn't help us get software working.

Fortunatly, all of this is not a problem unless we want to store certificate DN's in ASCII, _and_ want to have email addresses in the DN.

Yeah, you can see where this is going, can't you?

In the UK, Jens has been working to allow us to not have them in DN's. However, in the short term, they are present.

One particular case where ASCII representations of the DN are used is in LSC files - which are used to authenticate VOMS servers. What happens is if the VOMS server DN matches the DN in the LSC file, and the cert was signed by the CA DN in the LSC file, _and_ the certificate chain is signed by a trusted root, then it's valid. This process means that we don't need to distribute lots of VOMS server certs, just the root CA's, and a small note (that shouldn't change over renewals) of the server DN.

I've been tidying up our ARC install here, and during the process managed to break things. Not unusual for me, (one of the reasons I avoid tiding at all costs!), but this one was quirky. I'd put the vomsdir under CFEngine control, so that it was sync'd with all the other servers, and suddenly it stopped accepting the scotgrid VO.

Root cause, as if you can't guess by now, LSC file, and the emailAddress. Looks like the gLite stack expects it one way, and ARC the other. Of course, by the time you read this, that's probably been fixed somewhere, but not in the version we had installed.

It turns out that there's one trick in LSC files that saves this case. Let me put the LSC file in here:

/C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=svr029.gla.scotgrid.ac.uk/Email=grid-certificate@physics.gla.ac.uk
/C=UK/O=eScienceCA/OU=Authority/CN=UK e-Science CA
------ NEXT CHAIN ------
/C=UK/O=eScience/OU=Glasgow/L=Compserv/CN=svr029.gla.scotgrid.ac.uk/emailAddress=grid-certificate@physics.gla.ac.uk
/C=UK/O=eScienceCA/OU=Authority/CN=UK e-Science CA



The 'NEXT CHAIN' line lets one put multiple entries in the file. However, it appears that ARC isn't reading multiple, only the first one. So, in this case, I put the ARC friendly one first, so it matches fine - and the gLite stack tries again, finds the second, and thus suceeds.

Imporant notes: I can't find anyone else with a field report of NEXT CHAIN working in the gLite stack. This is such a field report. It doesn't appear to work with ARC.

Wednesday, December 21, 2011

Batch system juggling

We've been a bit quiet up here recently. This is normally a sign of either nothing interesting happening, or entirely too many interesting things happening. Opinions on that may divide, but I think it's closer to the latter...

One of the recent bits of fun that occurred was with our batch server. This story actually starts a long time ago; about this time last year. At that point, we started to get intermittent memory errors from the Torque server - corrected by ECC - but that's generally a sign that the RAM's about to fail. Given that the batch server is single point of failure for a site, that's not a good thing.

So I spent some time preparing a spare box, and being ready to move the batch system over, in case it failed over the winter break. Which, after all that prep, it didn't, and the errors stopped. On the expectation that the current hardware was nearing end of life, we ordered a new box early this year, and have had it sitting in a machine room for a while.

Unfortunately we didn't get time to have it running a tested batch system until our power supply started to ... well, insert colourful metaphor here, describing the 8 months where we were affected by lack of power.

Power got to stable supply in September, and so to catch up on things. One of the things we got around to was software versions. Whilst we didn't intent to update the Torque version, and managed to avoid it for a bit, the gLite developers eventually managed to sneak the update past us as part of an ordinary gLite update. Strictly, this didn't affect the batch server, just all the CE's, making them incompatible with the previous version of Torque.

Whilst a clever manoeuvre, reminiscent of Odysseus' Pony, it did leave us with a conundrum of either reverting the gLite update, or running forward with it. Neither were options of good character, but running forward did have some actual documentation; hence it was full speed ahead.

Which worked out well enough. The Torque 2.5.7 packages were set to use Munge, so getting that installed and tested as a first step helped it go smoothly. To preserve compatability in file locations, we used /etc/sysconfig/pbs_mom to put the pbs working directories in the same place as previously - meaning we didn't have to reconfigure any other tools.

What didn't go so smoothly was the memory leak in the server.

Which gave it a runtime of around 36 hours between crashes. Actually, not even crashes - we found that the pbs_server process hit either


12/05/2011 10:19:12;0080;PBS_Server;Req;req_reject;Reject reply code=15012(PBS_Server System error: No child processes MSG=could not unmunge credentials), aux=0, type=AlternateUserAuthentication, from tomcat@svr021.gla.scotgrid.ac.uk

or

10/29/2011 18:11:24;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Cannot allocate memory (12) in send_job, fork failed


and then sat around moaning. Had it crashed hard, then the auto-restart would have caught it. Ho, hum, one for the Fast Fail philosophy there.


By this point, my proof reader is pointing out that I started off talking hardware, and now talking software. Punchline is that the new server that we never got a chance to use has a lot more RAM than the old server. Therefore we wanted to move the server from the old hardware to the new, to give it a lot more RAM space. That won't fix the memory leak, but will mitigate the problem a bit.

Conventionally, this would involve draining the cluster, repositioning the CE's and then starting up everything again. Had we done that, this blog post would be over now.

Instead, we did a rolling update. This let us move things over without having to do a full drain. The biggest problem with a full drain is that, while most of the jobs finish within a shorter period of time that then limit, there are always some that take the full duration. This leaves us with an empty cluster, doing nothing, for 24 hours or so, wainting on a couple of jobs to finish.

So, instead, by moving things in small batches, then we can keep most of the nodes working, and thus get more work out of things. Step zero is to disable cfengine, otherwise it tends to try and 'fix' things part way through.

Step one is to drain a CE, which we did over a weekend, and a small number of nodes, which we put offline on the Sunday morning.

Come Monday, I set up and tested basic operations with the new batch server, and then moved the freed up nodes across to it. Once those were tested (which shook out a couple of issues about versioning of some libs), point the CE at the new batch server, and then run a test job though it. (It turns out that Atlas are fast enough to sneak some pilots through a 2 minute window for a test job. However, only a few, so they actually functioned as effective tests, without compromising the site if they failed).

After that, it's time to offline another CE, and then some more nodes, and start moving nodes over when they were empty. In the end I scripted this:


#!/bin/sh

NODE=$1
RUNNING=$(qstat -n -1 | grep $NODE | wc --lines)

if [ "x${RUNNING}" != "x0" ]
then
echo $NODE: Still $RUNNING jobs going, skipping
exit 2
fi

CORES=$(qmgr -c "print node ${NODE}" | grep "np = " | cut -d= -f2)

FROM=svr666
TO=svr999

echo $NODE: Moving to ${TO} with ${CORES} cores

ssh ${TO} "~/addNode.sh ${NODE} ${CORES}"

ssh ${NODE} "service pbs_mom stop"
scp config.mom.svr666 ${NODE}:/var/spool/pbs/mom_priv/config
ssh ${NODE} "service pbs_mom start"

ssh ${FROM} "~/deleteNode.sh ${NODE}"


In theory one can run qmgr remotely, rather than ssh-ing to the batch servers and running a script. In practice, with the different versions of Torque, I couldn't get that to work. Note the automation of the mom config switch as well; and that this script checks that the node is empty.

This reduced the gradual move of nodes to a process of croning the script, and offlining nodes occasionally.

The net result was that we were operating at around 80% capacity for 48 hours, and it was all rather uneventful - in a good way. The final step was to update cfengine config and re-enable it.

One of the plus points of the above script is that it should be simple to adapt to two distinct batch systems; which means if we end up moving away from Torque, we should be able to do that without downtime too.

Friday, September 23, 2011

Leaving Lyon

The EGI Tech Forum is winding down, with only a few talks remaining. It's been a great meeting, with a wide range of talks on all areas of Grid Computing. Lots to think about and new ideas to try out!

Wednesday, September 21, 2011

Scotgrid goes South

Last we week attended the bi-annual GridPP Collaboration meeting.
The venue this time was CERN itself and the meeting was, as ever, incredibly useful.

We were lucky enough to have presentations from the Experiments, the LHC, EGI and the WLCG community as well as presentations from across the UK collaboration.

A full programme of the meeting is available here:

http://www.gridpp.ac.uk/gridpp27/



Above is a picture of our own Dr Crooks presenting on the Glasgow Security Model

Monday, September 19, 2011

EGI Tech Forum 2011

Bonjour Lyon!

After last week's GridPP 27 meeting in CERN, this week we are in Lyon for the 2011 EGI Tech Forum, running from Monday until Friday this week. You can follow the Forum online using some of the links here.

More later - time now to find some coffee before the first session...

Thursday, August 25, 2011

Busy Disks

After checking a test 10 gig Disk Server deployment we uncovered an interesting pattern in storage network activity and how our 10 Gig switch copes with multiply connections at 10 Gigabit. The captures below were taken over a 5 minute window of operation and show just how bursty the traffic patterns from these devices can be.

The graphs show all interfaces on our Dell 8024F and the measurement window is in Mbps. The order is top to bottom with the initial capture at the top.




While the Disk servers have been hammering away the round trip time intra room has been on average 0.40 msec between devices as the CPU on the core Dell seems more than happy to be handle these loads as its utilisation is approximately 20% presently.

We are planning to enable QOS metrics on disk server traffic shortly to test the response times on QOS and Non-QOS disk servers.


News Flash from ScotGrid Labs

In my last post, we investigating deployments of IPv6 on the test Cluster, the 1st one of which was using SLAAC to assign addressing to hosts. Interestingly enough it worked, first time out the tin.

An IPv6 Traceroute from the web is shown below:

traceroute to 2001:630:40:ef0:230:48ff:fe5a:4b7 (2001:630:40:ef0:230:48ff:fe5a:4b7), 30 hops max, 40 byte packets
 1  2001:1af8:4200:b000::1 (2001:1af8:4200:b000::1)  1.600 ms  1.813 ms  1.882 ms
 2  2001:1af8:4100::5 (2001:1af8:4100::5)  1.320 ms  1.392 ms  1.465 ms
 3  be11.crs.evo.leaseweb.net (2001:1af8::9)  2.587 ms  2.631 ms  2.619 ms
 4  linx-gw1.ja.net (2001:7f8:4::312:1)  8.475 ms  8.466 ms  8.453 ms
 5  ae1.lond-sbr4.ja.net (2001:630:0:10::151)  78.338 ms  78.388 ms  78.376 ms
 6  2001:630:0:10::109 (2001:630:0:10::109)  9.900 ms  9.479 ms  9.446 ms
 7  so-5-0-0.warr-sbr1.ja.net (2001:630:0:10::36)  13.320 ms  13.196 ms  13.317 ms
 8  2001:630:0:10::296 (2001:630:0:10::296)  18.705 ms  18.542 ms  18.793 ms
 9  clydenet.glas-sbr1.ja.net (2001:630:0:8044::206)  18.947 ms  18.931 ms  18.948 ms
10  2001:630:42:0:3e::9a (2001:630:42:0:3e::9a)  19.434 ms !X  18.214 ms !X  17.682 ms !X


The next phase of testing will be to enable a webserver to speak in both IPv4 and IPv6 using this access mechanism and then onto a Grid services .


I will post up a more detailed explanation of the mechanisms used for this soon.

Tuesday, August 23, 2011

Two Stacks are better than one

Leading on from the last post, we have also re-introduced a new test cluster. This infrastructure is housed within the same rack as our old worker nodes  but is completely independent of the production cluster. Supporting a Dell 8024F are 5 servers and a Dell 5000 series switch which are connected via an independent 1 gigabit fibre connection to the University's network.

The purpose of this cluster is to test IPv4/IPv6 dual stack connectivity for grid Services, the testing of switch based security mechanisms and SL6 NAT testing without fear of impacting the real cluster.

The IPv6 connectivity model testing will be in multiple phases which include:

* SLAAC
* IPv6 to IPv4 tunneling
* IPv6 Routing


This framework is designed to comply with the HEPIX IPv6 Project and to look at the possible connection models required by Tier-2s to utilise IPv6. Additionally, we will be testing a wide variety of Grid enabled applications and associated systems such as Nagios to investigate potential issues within a dual stack deployment.

More on this soon.

Night of the Return of the Living Worker Nodes

As Glasgow is currently being used as one of the sets for World War Z, we thought it only apt that we too resurrect the dead and get them to do our bidding. No, we haven't embraced "mad" science.

During the power work  we decided to alter the layout of 243d. Historically, the room had housed a mainframe including operators booths. One of these booths still existed within 243d, so we took down one of the walls and added a new cabinet.

While the work was being conducted to remove the wall we covered the cluster and powered it off to minimise dust ingestion. If you wish to gift wrap a cluster we have plenty of experience in this field. However, our wrapping is limited to blue plastic presently.



After the wall had been removed, we cleared out the computer room and re-organised the storage cabinets, cabling and computing cabinets. In 243d there were a pile of 6 year old disused worker nodes and racked worker nodes whose PDU had been damaged during one of our many power cuts over the last 12 months. In addition to this we found and rebuilt a Dell Rack and also we had a spare Nortel 5510 switch.




With the newly available space from the removal of the wall in 243d, we got a tile cut and deployed the rack. The rack connects back to the older Stack01 via a copper gigabit Ethernet connection. This deployment will give us up to approximately 100 job slots once they are fully configured.