ScotGrid: June 2008

Friday, June 27, 2008

Attack of the Clones

So, I finally got round to booting the other three new servers we'd recently purchased (these are nice simple dual core, dual CPU boxes (Dual-Core AMD Opteron(tm) Processor 2216) on Tyan Thunder n3600M (S2932) Motherboards)

I'd always been exceptionally suspicious of the PXE MAC address thet the 1st one offered (60:50:40:30:20:10) rather than the imprinted one on the rj45 socket starting with 00:E0:81 but the fact that three out of four machines all had the same mac address meant a phone call to the vendor was called for.

The one that works has the 2.01 BIOS, the others were already shipped with 2.02 and downgrading didn't cure the amnesia. I don't see a utilty to reburn the mac address ont the machine, so have left that in the capable hands of the vendor for now. Googling seems to suggest that nvidia cards are somewhat prone to wierdisms like this. Bother.

Tuesday, June 24, 2008

Powercut at Durham

Ok, first blog post here...

On sunday, our machine room's UPS caused a brief power failure, which unfortunately tripped some breakers so we had to call out the electricians before we could start restoring service.

The UPS will take some time to repair, so Durham will be at risk for a while.

On the plus side, the changes involved with the SE rebuild have been proven to survive a reboot!

EDIT:
Just realised that I forgot to say that the site is back online, and has been since sunday evening, it is just currently at mercy of the power company.

Monday, June 23, 2008

ECDF - More VOs

Sam reports that:

"biomed, pheno, vo.scotgrid.ac.uk and vo.gridpp.ac.uk VOs have been enabled on the SL3 CE (ce.glite), the SRM (for storage), and the monitoring/accounting box (mon.glite)."

This seems to have produced a glitch on account records publication on the MON box, but a fix is in the pipeline.

ECDF SAM Tests Fixed

Sam managed to fix SAM last week. It turned out that the two CEs were interfering with each other in the samsgm account area and producing corrupted GASS caches. It's still a bit of a mystery as to why this happened - shared accounts for *sgm has been normal (if not best) practice for ages. However, it's being worked around and the site is up and stable again.

Durham SE Issues

Durham suffered a complete SE failure last week. A RAID card failure took down the old SE, gallows, and then an LVM metadata corruption took out the new disk server on se01.

The list of lost ATLAS files has been reported (https://savannah.cern.ch/bugs/?38037) and we're waiting for the catalog to be cleaned up to restart production here (well, when there are any jobs to run).

We took the opportunity to retire gallows and now se01 is the sole SE at Durham. It should suffice for ATLAS production where we only need a few TB cache anyway.

In the meantime there was a power outage in the Durham machine room over the weekend. David had to get the university to reset some breakers but things seem to be running well now.

Saturday, June 14, 2008

Publish and be damned...

It all went pear shaped yesterday as information publishing fell over on the SE. It seems when I quickly "fixed" the DPM information provider script to get the correct hostname I forgot to chomp() the output from "hostname -f". So the hostname variable had a trailing newline which corrupted the information system. The BDII logs started to throw errors like "First line of LDIF entry does not begin with 'dn:' at /opt/glite/libexec/glite-info-generic line 17".

Unfortunately the BDII then considered the whole of the SE information package corrupt (rather than just that provider's output) and our SE promptly dissappeared from the information system with the attendent RM test failures.

This situation then persisted for most of the day until Andrew noticed it "by eye". So we had another failure - nagios didn't send an alarm properly when we started to fail. If that had happened it would have been fixed in a hour, but instead we were failing for 8 hours.

From the dizzy heights of SAM perfection we fell to 98% for the month, 95% for the week. It wasn't quite hubris, but it was ironic that I was blogging about Glasgow's reliability at the very moment we were broken.

At the moment I have removed the info provider for tokens, and I will more carefully put it back on Tuesday.

Friday, June 13, 2008

Problems up north

We have two major problems in ScotGrid right now:

ECDF: Have been failing SAM tests for over a week now. The symptom is that the SAM test is submitted successfully, runs correctly on the worker node, but then job outputs never seem to get back to the WMS, so eventually the job is timed out as a JS failure. As usual we cannot reproduce the problem with dteam or ATLAS jobs (in fact ATLAS condor jobs are running fine) so we are hugely puzzled. Launching a maual SAM test throught the CIC portal doesn't help because the test gets into the same state and hangs for 6 hours - so you cannot submit another one. Sam has asked for more network ports to be opened to have a larger globus port range, but the network people in Edinburgh seem to be really slow in doing this (and it seems it is not the root cause anyway).

Durham: Have suffered a serious pair of problems on their two SE hosts. The RAID filasystem on the headnode (gallows) was lost last week and all the data is gone. Then this week the large se01 disk server suffered an LVM problem and we can no longer mount grid home areas or access data on the SRM. Unfortunately Phil is on holiday, David is now off sick and I will be away on Monday - hopefully we can cobble something together to get the site running on Tuesday.

Thankfully, dear old Glasgow T2 is running like a charm right now (minor info publishing and WMS problems aside). In fact our SAM status for the last month is 100%, head to head with the T1! Fingers crossed we keep it up.

Much improved CE

Our bugbear in the past was always the lcg-CE, which was a service it was easy to overload and cause the site to hit meltdown (we have a lot of examples collected in http://scotgrid.blogspot.com/search/label/CE.).

A few months ago a new daemon, the globus cache marshal, was introduced which promised to substantially reduce the load on the old CE.

Recently we have had a few job spikes from local atlas and pheno users and I'm very happy to say that the CE seems much healtier than in the past. Having more than 1.5k jobs running and queued the load on the CE was modest and the CPU usage was < 20%.

This is a huge improvement over past performance and has removed a major source of site instability.

Who's crying now?

Our poor WMS was killed from the end of last week when a pheno user submitted about 20k jobs into it. Worse, they hadn't used a proxy renewal service, so their VOMS extension on the proxy expired and so the jobs they had submitted suffered shallow failures, promting further resubmission attempts and further load.

We have contacted the user, but they were having a great deal of trouble even cancelling the jobs. We've left things for 2 days now, but the situation is really not improving. I can't see much hope for the machine in its current state - we'll probably have to blow it away and start again next week.

For the moment we've asked our users to revert to using ye olde RB, which we were just about to switch off but hadn't actually decomissioned.

There's a serious question now about what level of WMS service we want to provide. It's a complex service and rather difficult to debug when it goes wrong. Certainly an upgrade to the SL4 version should be done, but do we want to have 2 WMS hosts and possibly even a separate LB service?

disinformation

We just got ticketed for a failing SE sam test. Most odd as Steve Lloyds SAM results were all green. Re-read the ticket and it turns out we were publishing info for svr018.beowulf.cluster rather than the external interface name. Despite this being noted before Graeme hacked it around and raised a Savannah Ticket

Monday, June 09, 2008

And don't do it again...

I re-enabled Heinz today on the cluster, our infamous RSA cracking biomed user.

Tony finally spoke to Cal and this seemed clarify that

VOs now know they have a more serious responsibility to discipline their users
Heniz knows he cannot run this work again under biomed

He'd been suspended for six months which seems like an appropriate punishment.

ScotGrid