ScotGrid: Outage

Showing posts with label Outage. Show all posts

Wednesday, September 24, 2008

opendns to the rescue

Glasgow, Edinburgh and Durham suffered SAM failures today due to the scotgrid BDII going AWOL. Actually the BDII itself was OK, the problem was caused by the campus DNS servers taking ages to respond and the LDAP query timing out before they responded.

Cue one quick switchover to OpenDNS servers instead.

Worth scribbling on a sticky note - the 2 nameserver IPs are 208.67.222.222 and
208.67.220.220

Update to the above:
OpenDNS don't return NXDOMAIN for non-existent domains, such as .beowulf.cluster -- This can break your installer horribly (as we discovered at glasgow) if you're expecting things to check which is the right address)

However as we're using dnsmasq you can get round this by flagging the 'helpful' opendns guide addresses as bogus:

ie setup your /etc/dnsmasq.conf


no-resolv
server=208.67.222.222
server=208.67.220.220
bogus-nxdomain=208.69.34.132

This then gives the expected results:

svr031:~# dig www.flarble.co.uk

; <<>> DiG 9.2.4 <<>> www.flarble.co.uk
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 10483
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.flarble.co.uk.             IN      A

;; Query time: 105 msec
;; SERVER: 10.141.255.254#53(10.141.255.254)
;; WHEN: Fri Oct 31 09:52:00 2008
;; MSG SIZE  rcvd: 35


compared to...
svr031:~# dig www.flarble.co.uk @208.67.222.222

; <<>> DiG 9.2.4 <<>> www.flarble.co.uk @208.67.222.222
; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 24219
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.flarble.co.uk.             IN      A

;; ANSWER SECTION:
www.flarble.co.uk.      0       IN      A       208.69.34.132

;; Query time: 11 msec
;; SERVER: 208.67.222.222#53(208.67.222.222)
;; WHEN: Fri Oct 31 09:52:13 2008
;; MSG SIZE  rcvd: 51

Thursday, February 21, 2008

Glasgow Downtime

As announced on GOCDB We're taking UKI-SCOTGRID-GLASGOW down until 17:00 localtime tomorrow (friday) to bring forwards the maintenance we'd planned for March. This was due to an unexpected CE failurelast night that meant the queues were empty.

Thursday, October 25, 2007

... and we're back

OK - Estates and buildings found enough spare change to last us through for another couple of hours. One workernode casualty (so far) but otherwise things seem to be OK. Will monitor for sanity then remove us from unscheduled downtime.

Glasgow site missing, presumed AWOL

This morning uki-scotgrid-glasgow and other systems in the same building are offline. We're not sure of cause yet and wil investigate further once someone's on site.

UPDATE - 09:08 - Confirmed as power cut affecting building (see http://www.gla.ac.uk/services/it/helpdesk/#d.en.9346) No ETA yet for restoring services. EGEE Broadcast sent.

Monday, October 22, 2007

Disk Servers and Power Outages

Your starter for 10TB: how long does it take to do an fsck on 10TB of disk? Answer: about 2 hours.

Which in theory was fine - in at 10am, out of downtime by 12. However, things didn't go quite according to plan.

First problem was that disk038-41, which had been setup most recently, had the weird disk label problem, where the labels had been created with a "1" appended to them. Of course, this wouldn't matter, except that we'd told cfengine to control fstab, based on the older servers (with no such ones), so the new systems could not find their partitions and sat awaiting root intervention. That put those servers back by about an hour.

Secondly, some of the servers were under the impression that they had not checked their disks for ~37 years (mke2fs at the start of the epoch?), so had to be coaxed into doing so by hand, which was another minor hold up.

Third, I had decided to convert all the file systems to ext3, to avoid protect them in the case of power outages. It turns out that making a journal for a 2TB filesystem actually takes about 5 minutes - so 25 minutes for your set of 5.

And, last but not least, the machines needed new kernels, so had to go through a final reboot cycle before they were ready.

The upshot was that we were 15 minutes late with the batch system (disk037 was already running ext3, fortunately), but an hour and 15 minutes late with the SRM. I almost did an EGEE broadcast, but in the end, who's listening Sunday lunchtime? It would just have been more mailbox noise for Monday, and irrelevant by that time anyway.

As the SRMs fill up, of course, disk checks will take longer, so next time I would probably allow 4 hours for fscking, if it's likely.

A few other details:

* The fstab for the disk servers now names partitions explicitly, so no dependence on disk labels.
* The BDII going down hit Durham and Edinburgh with RM failures. Ouch! We should have seen that one coming.
* All the large (now) ext3 partitions have been set to check every 10 mounts or 180 days. The older servers actually had ~360 days of uptime, so if this is an annual event then doing an fsck once a year should be ok.

Sunday, October 21, 2007

Glasgow power outage

Ho Hum, it's that time of year for HV switchgear checking. Rolling programme of work across campus meant that the building housing the glasgow cluster was due for an outage at ungodly-o-clock in the morning. We arranged the outage in advance, booked scheduled downtime. All OK. Then after G had taken some well deserved hols I discovered how dreadful the cic-portal was to send an egee broadcast. I want to tell users of the site it;s going down. Any chance of this in english? RC management? What is 'RC' - doesn't explain it. Then who should I notify? agan no simple descriptions... Grr. Rant..

OK - system went down cleanly easily enough (pdsh -a poweroff or similar) - bringing it back up? hmm. 1st off the LV switchboard needed resetting manually so the UPS has a flat battery. Then one of the PDU's decided to restore all the associated sockets to 'On' without waiting to be told. (so all the servers lept into life before the disks were ready). Then the disk servers decided they needed to fsck (it'd been a year since the last one) - slooooow. Oh, and the disklabels on the system disks were screwed up (/1 and /tmp1 rather than / and /tmp for exmple) - another manual workaround needed.

Finally we were ready to bring the workernodes back - just on the 12:00 deadline. I left graeme still hard at it, but there's a few things we'll need to pull out in a post mortem. I'm sure Graeme will blog some more

ScotGrid