ScotGrid: September 2008

Monday, September 29, 2008

First among equals...

We got delivery of a sample WN from Viglen a week or so ago. Andrew and Mike did the cluster magic to integrate it into the system and I decided last night to open it up to some real jobs.

Results: 8/8 successful! (Snapshot from ATLAS panda.)

Unfortunately then ATLAS production dried up in the UK, but when the jobs come back, we're ready!

Wednesday, September 24, 2008

opendns to the rescue

Glasgow, Edinburgh and Durham suffered SAM failures today due to the scotgrid BDII going AWOL. Actually the BDII itself was OK, the problem was caused by the campus DNS servers taking ages to respond and the LDAP query timing out before they responded.

Cue one quick switchover to OpenDNS servers instead.

Worth scribbling on a sticky note - the 2 nameserver IPs are 208.67.222.222 and
208.67.220.220

Update to the above:
OpenDNS don't return NXDOMAIN for non-existent domains, such as .beowulf.cluster -- This can break your installer horribly (as we discovered at glasgow) if you're expecting things to check which is the right address)

However as we're using dnsmasq you can get round this by flagging the 'helpful' opendns guide addresses as bogus:

ie setup your /etc/dnsmasq.conf


no-resolv
server=208.67.222.222
server=208.67.220.220
bogus-nxdomain=208.69.34.132

This then gives the expected results:

svr031:~# dig www.flarble.co.uk

; <<>> DiG 9.2.4 <<>> www.flarble.co.uk
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 10483
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.flarble.co.uk.             IN      A

;; Query time: 105 msec
;; SERVER: 10.141.255.254#53(10.141.255.254)
;; WHEN: Fri Oct 31 09:52:00 2008
;; MSG SIZE  rcvd: 35


compared to...
svr031:~# dig www.flarble.co.uk @208.67.222.222

; <<>> DiG 9.2.4 <<>> www.flarble.co.uk @208.67.222.222
; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 24219
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.flarble.co.uk.             IN      A

;; ANSWER SECTION:
www.flarble.co.uk.      0       IN      A       208.69.34.132

;; Query time: 11 msec
;; SERVER: 208.67.222.222#53(208.67.222.222)
;; WHEN: Fri Oct 31 09:52:13 2008
;; MSG SIZE  rcvd: 51

Wednesday, September 17, 2008

mmmm. Shiny!

Yesterday we took delivery of one of the new workernodes that we're purchasing for the 'Phase 2' expansion of Glasgow.

Basically - Supermicro 6015TW-T servers with dual motherboards, each with 2* quad core 2.5GHz intel CPUs. Oh, and 2G RAM per core and a 500G HDD per motherboard.

Integration into the YPF installer was suprisingly painless - I generated a hundred or so new SSH keys and configs for the new boxes (still to do the cfengine ones - they're slightly fiddlier). updated the database of MAC addresses, wrote out the dnsmasq config and restarted the dnsmasq daemon. did a 'setboot' and lo, up n running.

Annoyingly I had to make some minor BIOS changes to these - we always want the nodes to power up in the event of a 'power failure' (ie, we shot them with the APC masterswitch) and there's no point them asking for a PXE boot off the second NIC (its not connected)

oh and the last bunch of workers had IDE disks not SATA (change kickstart to /dev/sda not /dev/hda for target).

So - status is the 2 machines are up n configured, now to get them into Torque (which is still playing sillybuggers wrt the gLite version - they package a pre-release 2.3.0 and it doesn't have the libtorque.0 for monami. Oh and diagnose -f truncates at 65k characters....

Friday, September 05, 2008

take that cfengine

We've had a long running problem with cfengine at glasgow - 2.2.3 (the latest DAG) didn't expand out HostRange properly on the non-workernodes (ie where we need it most - disksvr, gridsvr, natbox groups). today I spent far too long battling with both 2.2.8 and the latest svn release (don't go there - its far too fussy about the exact release of aclocal you use) and neither of them worked properly.

I finally got a minature testcase configuration file to work, then got *really* confused when I used our live config as a testcase file sucessfully, but not the normal incantation.

it turned out to be the fact we'd defined

domain = ( beowulf.cluster )
in update.conf

however, setting this broke the way cfengine handles FQDNs on the dual-homed nodes (which are gla.scotgrid.ac.uk and beowulf.cluster). Commented it out leaving cfengine to guess the right thing to do, and it all seems OK.

I have since upgraded uniformly to 2.2.3 across all the SL4 x86_64 machines and tested OK.

While doing this I noticed we hadn't defined the WMS as a mysqld node so we weren't monitoring it in nagios or backing up the database. Oops. Sorted.

ScotGrid