ScotGrid: monitoring

Showing posts with label monitoring. Show all posts

Tuesday, April 17, 2012

One XOS, Great Big Purple Packet Eater. Sure looks good to me.

So we haven't been blogging a great deal since December and for good reason. We found ourselves in the exciting position of being given additional funding to enhance our network capability and also we had additional equipment to install into the cluster.

First things first however, as you may have read we have had no end of issues with the older network equipment. We had a multi-vendor environment which, while adequate for 800 analysis jobs and 1200 production jobs, wasn't quite up to cutting the mustard as we couldn't expand from there.

The main reason was the 20 Gig link between the two computing rooms which was having real capacity issues. Also, add in issues between the Dell and Nortel LAG and associated back flow problems, sprinkled with a buffer memory issue on the 5510s and you get the picture. In addition to this we were running out of 10 Gig ports and therefore couldn't get much bigger without some investment.

Therefore, the grant award was a welcome attempt to fix this issue. After going to tender we decided upon equipment from Extreme Networks. The proposed solution allowed for a vast 160 Gigabit interconnect between the rooms broken into two resilient link bundles in the Core and an 80 Gigabit Edge layer. In addition to this connection we also installed a 32 core OM4 grade fiber optic network for the cluster which will carry us into the realms of 100 Gigabit connections, when it becomes available and cheap enough to deploy sensibly.

We now have 40 x 40 gigabit port, 208 x 10 gigabit ports and 576 1 x Gigabit ports available for the Cluster.

There is quick and clever and here it is

The new deployment utilises X670s in the Core and X460s at the Edge.

The magic of the new Extreme Network is that it uses EAPS, so bye bye Spanning Tree and good riddance as well as MLAG which allows us to load share traffic across the two rooms so having 10 Gigabit connections for disk servers in one room is no longer an issue.

Then it got a bit better. Due to the Extreme OS we can now write scripts to handle events within the network which ties in with the longer term plan for a Cluster Expert System (ARCTURUS) which we are currently designing for test deployment. More on this after August.

Finally, it even comes with its own event monitoring software, Ridgeline which gives a GUI interface to the whole deployment.

We stripped out the old network installed the new one and after some initial problems with the configuration, which were fixed in a most awesome fashion by Extreme got the new one up and running. What we can say is that the network isn't a problem anymore, at all.

This has allowed us to start to concentrate upon other issues within the Cluster and look at the finalised deployment of the IPV6 test cluster which has benefited in terms of hardware from the new network install. Again, more on this soon.

Right, so now to the rest of the upgrade we have also extended our cold isle enclosure to 12 racks, have a secondary 10 Gig link onto the Campus being installed and have a UPS. In Addition to this we refreshed our storage using Dell R510s and M1200s as well as buying 5 Interlagos boxes to augment the worker node deployment.

The TARDIS just keeps growing

We also invested in an experimental user access system with wi-fi and will be trying this out in the test cluster to see if a wi-fi mesh environment can support a limited number of grid jobs. As you do.

In addition to this we improved connectivity for the research community in PPE at Glasgow and across the Campus as a whole, with part of the award being used to deliver the resilient second link and associated switching fabrics.

It hasn't been the most straight forward process as the decommissioning and deployment work was complex and very time consuming in an attempt to keep the cluster up and running as long as possible and to minimise down times.

We didn't quite manage this as well as expected due to the configuration issues on the new network but we have now upgraded the entire network and have removed multiple older servers from the cluster to allow us to enhance the entire batch system for the next 24 - 48 months.

As we continue to implement additional upgrades to the cluster we will keep you informed.
For now it is back to the computer rooms.

Thursday, August 19, 2010

Why, yes ... we were using that...

So .... remind me never to do a 'nothing much happening' post again. It looks like tempting fate results in Interesting Times.

Our cooling setup in one of the rooms is a bit quirky; based on a chilled water system (long story, but it was originally built for cooling a laser before we ended up with it). There's been a few blips with the water supply, so duely an engineer was dispatched to have a poke at it.

The 'poke' in this cases involved switching it off, until he could delve into the midsts of the machine, resulting in the rather exciting peak in temperatures (these measured using the on board thermal sensors in the worker nodes).

We were supposed to get a warning from the building systems when the chiller went offline, and again when the water supply temperature rose too high. (The air temperature lags behind the water temp, so it's a good early warning). As neither of those happened, our first warning was the air temperature in the room, followed by the nodes internal sensor alarms.

First course of action was to offline the nodes, and then find the cause of the problem. Once found, there was a short ... Explanation ... of why that was a Bad Time to switch off the chiller. We'll schedule some downtime to get it done later; at some point when we're not loaded with production jobs.

Still, little incidents like this are a good test for the procedures. Everything went pretty smoothly, from offlining nodes to stop them picking up new jobs, through to the defence in depth of multiple layers of monitoring systems.

Thankfully, we didn't need to do anything drastic (like hard powering off a rack); so we now know how long we have from a total failure of cooling until the effects kick in. Time to sit down and do some sums, to make sure we could handle a cooling failure at full load that occurs at 3am...

Update: 19/08/2010 by Mike

Never mind "sums", I took the physicist's approach a couple of years ago and got some real data:

Triangles (offset slightly along x-axis for clarity) are the temperatures of worker nodes as reckoned by IPMI; stars are input air temperatures to the three downflow units in room 141 and the squares are flow/return water temperatures. I simulated a total loss of cooling by switching the chilled water pump off; all worker nodes were operating at their maximum nominal load. It took ~20 minutes for the worker node temperatures to reach 40 degrees, at which point I bottled it and restored cooling. So, for good reason, we now run a script that monitors node temperatures, and has the ability to power them off once a temperature threshold is breached. Oh, and that has been tested in anger.

Tuesday, June 22, 2010

A baffling spot of localised cooling

How do you keep your cool in this sort of weather? Well, there's various options, but I'll bet one you've not tried is wrapping up in lots of insuating foam.

And yet, that's been just the ticket for some worker nodes up here; despite it being one of the warmer days (23° C outside). Have a look at the temperature graph, and see if you can spot when something changed:

(The peak at midnight was due to a sneak attack Hammercloud; it was just before 12 when I put in the insulation.)

I'd discovered that there's some empty head space at the top of the racks. In those racks were there's a network switch at the top, this wasn't doing much, but where there were worker nodes, the top node was a lot hotter than the node two down from it. That's a lot sharper change than I'd expected - it was noticeable by touching the metal cover on the front of the nodes. The theory was that hot air out the back of the nodes was being sucked forward over the top of the highest node (through the headspace), and then recirculated round, getting hotter, until the steady state of it was about 5 K hotter that the others.

So, it was time to do something about that. First couple attempts at stopping up the gap didn't have much effect, until I dug out a few bits of packing foam (that the nodes were shipped in). Being, of course, the correct width, and jut a bit taller than 1U, they fit snugly into the headspace.

And that foam baffle reduced the temperature; to the point that the node at the top of the racks are now at the lowest temperature since records began! (i.e. they were installed.) Counter intuitive, but that's the way air/heat flow goes sometimes.

Although these worker nodes are due for replacement, we're going to be reusing the racks themselves, so little things like this are good to know. It may be that this won't be a problem with the new worker nodes - or it might be the case that it'd be worse. Either way, forewarns is fore armed (and cooler).

Friday, September 05, 2008

take that cfengine

We've had a long running problem with cfengine at glasgow - 2.2.3 (the latest DAG) didn't expand out HostRange properly on the non-workernodes (ie where we need it most - disksvr, gridsvr, natbox groups). today I spent far too long battling with both 2.2.8 and the latest svn release (don't go there - its far too fussy about the exact release of aclocal you use) and neither of them worked properly.

I finally got a minature testcase configuration file to work, then got *really* confused when I used our live config as a testcase file sucessfully, but not the normal incantation.

it turned out to be the fact we'd defined

domain = ( beowulf.cluster )
in update.conf

however, setting this broke the way cfengine handles FQDNs on the dual-homed nodes (which are gla.scotgrid.ac.uk and beowulf.cluster). Commented it out leaving cfengine to guess the right thing to do, and it all seems OK.

I have since upgraded uniformly to 2.2.3 across all the SL4 x86_64 machines and tested OK.

While doing this I noticed we hadn't defined the WMS as a mysqld node so we weren't monitoring it in nagios or backing up the database. Oops. Sorted.

Wednesday, August 27, 2008

gatekeeper AWOL

Glasgow suffered 3-4 hours CE outage this evening as the globus-gatekeeper on svr021 had gone AWOL. we suffered a few SAM tests before I twigged that the 'connection refused' was coming from our end - 'service globus-gatekeeper restart' nobbled that but not until we'd failed 7 sam tests. Damn.

Monday, August 25, 2008

SAM Failures across scotgrid: Someone else's problem

All 3 scotgrid sites have just failed the atlas SAM SE tests (atlas_cr, atlas_cp, atlas_del) as have quite alot of the rest of the UKI-* sites.

Once again this isn't a Tier-2 issue but an upstream problem with the tests themselves

ATLAS specific test launched from monb003.cern.ch
Checking if a file can be copied and registered to svr018.gla.scotgrid.ac.uk


   ------------------------- NEW ----------------
srm://svr018.gla.scotgrid.ac.uk/dpm/gla.scotgrid.ac.uk/home/atlas/
+ lcg-cr -v --vo atlas file:/home/samatlas/.same/SE/testFile.txt -l lfn:SE-lcg-cr-svr018.gla.scotgrid.ac.uk-1219649438 -d srm://svr018.gla.scotgrid.ac.uk/dpm/gla.scotgrid.ac.uk/home/atlas/SAM/SE-lcg-cr-svr018.gla.scotgrid.ac.uk-1219649438
Using grid catalog type: lfc
Using grid catalog : lfc0448.gridpp.rl.ac.uk
Using LFN : /grid/atlas/dq2/SAM/SE-lcg-cr-svr018.gla.scotgrid.ac.uk-1219649438
[BDII] sam-bdii.cern.ch:2170: Can't contact LDAP server
lcg_cr: Host is down
+ out_error=1
+ set +x
   -------------------- Other endpoint same host -----------

Tuesday, July 15, 2008

Alert! Alert!

'twas the night before holidays, when all through the servers not a pager was stirring...

Hmm lulled into a false sense of security by the appearance of [WLCG Nagios] entitled emails alerting me about proxy expiry on the shared nagios system I foolishly thought all was well. However we wern't getting any 'real' alerts from the system to the individual sites.

Turned out to be a configuration issue in the /etc/nagios/uki-scotgrid-*/contacts.cfg

We had


    service_notification_options    n
    host_notification_options       n

meaning no notifications were sent - changed this to


    service_notification_options    w,u,c,r,f
    host_notification_options       d,u,r,f,s

which means we get alerted on pretty much every state change - for more details see the manual

on a more annoying note - I left my macbook PSU back in the UK and there's a limited no of apple resellers here :-(

Saturday, April 12, 2008

Splunk / nagios / logrotate

Well, I upgraded to nagios3 this evening on the cluster and noticed it had a new enable_splunk_integration option in the cgi.cfg - I'd looked at splunk before and thought 'hmm, nice idea, not sure it'll work with the grid stuff' but decided to give it a whirl

first up - nagios gotchas - We had the dag rpm installed which hasn't been updated to the 3.0 let alone the 3.0.1 release so went for the manual compile option. Discovered that the (gd|libjpeg|libpng)-devel packages weren't installed - quickly fixed by yum.

took the ./configure line from the spec as a guide - however it managed to splat the cgi's into /usr/sbin rather than /usr/lib64/nagios/cgi - thanks :-( soon found em and moved em round. seems to be working OK - not installed the newer wlcg monitors yet - thats the next task.

Splunk - looks flash but is it any good? There's no sign of any educational pricing on their website and the 'free' version has one HUGE weakness - no user authorisation / login. Temp workaround of some iptables rules to reduce risk and had a play. Defined /var/log on our central syslog server as a datasource and watched it go.

well, sort of... it promptly filled /opt/splunk as it makes an indexed copy of anything it finds, - I think for a real install we'd need some new space on a disk. secondly it quicky swallowed more than its 500M/day 'free' allowance - grabbed a 30day trial licence of the enterprise version and lo it now complains that I've had 2 licence violations of over 5G/day indexed. Harumph.

not sure if this would settle down once it goes through the backlog of the archived logfiles - perhaps if I implement only a syslog FIFO for it it'd be happier. Also we have the 'traditional' logrotate style of .1 .2 .3 etc rather than the more dirvish friendly dateext option - we should really swap... if the RHEL logrotate supports it :-/

"rpm -q logrotate --changelog" doesnt mention it although its fixed in fedora

The other issue is that splunk thrashes the box as it indexes, and it's just stopped as its filled the disk again. Ho Hum.

Thursday, March 27, 2008

p p p pick up a pakiti

We've been using pakiti at Glasgow for some time now for keeping an eye on which nodes are out of date. One minor niggle is that it doesn't keep track of the grub default kernel (ie what should come in on reboot) compared to the running kernel

We already had a v simple shell script that did that:


 pdsh -w node[001-140] chkkernel.sh | dshbak -c
----------------
node[001,005,007,014,016-020,022-023,025,028,031-061,063-085,087-090,092,095-096,098-101,103-104,106-107,109-110,113,115,118-120]
----------------
 Running: 2.6.9-67.0.7.ELsmp, Grub: 2.6.9-67.0.7.ELsmp, Status OK
----------------
node[062,091,093-094,097,102,105,108,111-112,114,116-117,121-127,129,131,133-134,136-139]
----------------
 Running: 2.6.9-55.0.9.ELsmp, Grub: 2.6.9-67.0.4.ELsmp, Status error
----------------
node[003,009,011,013,015,021,027,029]
----------------
 Running: 2.6.9-55.0.12.ELsmp, Grub: 2.6.9-67.0.7.ELsmp, Status error
----------------
node[128,130,132]
----------------
 Running: 2.6.9-55.0.12.ELsmp, Grub: 2.6.9-67.0.4.ELsmp, Status error
----------------
node[002,004,006,010,012,024,026,030,140]
----------------
 Running: 2.6.9-55.0.9.ELsmp, Grub: 2.6.9-67.0.7.ELsmp, Status error
----------------
node[086,135]
----------------
 Running: 2.6.9-67.0.4.ELsmp, Grub: 2.6.9-67.0.4.ELsmp, Status OK
----------------
node008
----------------
 Running: 2.6.9-67.0.4.ELsmp, Grub: 2.6.9-67.0.7.ELsmp, Status error

but I finally got it integrated with some patching - see http://www.scotgrid.ac.uk/wiki/index.php/Pakiti

result - pretty green / red status on the "default kernel' column.

The patches have been emailed to Romain so may well appear upstream eventually

Monday, February 25, 2008

Dem info system blues

I fixed a problem on the CE information system tonight. YAIM had gone a little screwy and incorrectly written the lcg-info-dynamic-scheduler.conf file, so I had added the lrms_backend_cmd parameter myself as:

lrms_backend_cmd: /opt/lcg/libexec/lrmsinfo-pbs -h svr016.gla.scotgrid.ac.uk

Adding the host seemed sensible as the CE and the batch system don't run on the same node, right? Wrong! the host paramater ends up being passed down to "qstat -f HOST" which is a broken command - we ended up with zeros everywhere for queued and running jobs and, consequently a large stack of biomed jobs we are unlikely ever to run.

I raised the obligatory GGUS ticket: https://gus.fzk.de/pages/ticket_details.php?ticket=33313

Monday, February 04, 2008

cluster glue

hmm. Freudian? I originally typed 'cluster clue' as the title.

Regular readers will be aware that we run both ganglia and cfengine. However even our wonderful rebuld system (YPF) doesn't quite close off all the holes in the fabric monitoring. case in point - reimaged a few machines and noticed that ganglia wasn't quite right. It'd copied in the right gmond.conf for that group of machines but hadnt checked that it was listed in the main gmetad.conf as a data_source,

Cue a short Perl script (soon to be available on the scotgrid wiki) to do a sanity check, but it;s this sort of non-joinedupness of all the bits that really annoys me about clusters and distributed systems.

Are there any better tools? (is Quattor the savoiur for this type of problem)

/rant

Thursday, December 13, 2007

Grid-Monitoring Nagios tests

Finally (after a long interval) I reinvestigated getting the LCG Grid Service Monitoring Working Group nagios tests installed at Glasgow.

I had tried once before, but it needed nagios on our UI. This time I added a UI to our nagios host (nice n simple - simply added the hostname into the relevant UI group in cfengine). Works fairly well - I've got it installed and polling the SAM tests via the sam-API and picking up the results. I still need to get certificate proxy renewals working, and merge the records together with our existing definitions for the hosts (we use non-qualified names, the wlcg.cfg used FQDNs)

As the screen shot shows - We've already got a lot of green - and if we can nail the cert problems I'll switch it over to use the normal notification system

Friday, October 05, 2007

knotty nat knowledge

hmm, thats odd, why aren't the NAT boxes visible on ganglia?

Seemed a simple enough problem - they used to be there, but for some reason fell off the plots late August.

Had boxes restarted and failed to start gmond? nope - good uptime. gmond running? yep. Telnet to gmond port? yep. hmm.

<CLUSTER NAME="NAT Boxes" ... >
</CLUSTER>

and no HOST or METRIC lines between them. Most odd. After some discussion with Dr Millar it turned out to be a probable issue with the Linux Multicast setup - the kernel wasn't choosing the same interface to listen and send on. Luckily this was patched in a newer version of ganglia - the config file supports the mcast_if parameter to allow explicit setting (in our case to the internal ones).

Sadly of course the out-of-the box RPM doesn't install on SL4 x86_64 - needs unmet dependencies (as normal....) so a quick compile on one of the worker nodes and some dirty-hackery-copying the binary over worked a treat. We now have natbox stats again..

Tuesday, September 25, 2007

Batch System Goes on Holiday?

When I started to fiddle with the UI and RB on Saturday night, I discovered that the site was failing SAM tests, with the, as usual, marvellously descriptive error "Unspecified gridmanager error".

Further investigation showed that torque and maui servers were not running. When I restarted them the site recovered immediately. The very curious thing was, though, that torque logfile entries were still being written - so there was some part of torque running, but not enough to accept new jobs.

We need a nagios alarm on this. Paul tells me that there is a torque.available metric in the MonAMI sensor, so we should be able to passively monitor this - see the above graph which shows the dropout on Saturday afternoon.

Wednesday, September 12, 2007

Nagios Acknowledements

We've been sucessfully using Nagios as one component of our site monitoring. The email and Jabber notification is great, but when someone acks a problem, we have to look at the webpage to see who did it and what comments they added.

One minor tweak to commands.cfg soon fixed that - thanks to the macros $SERVICEACKAUTHOR$, $HOSTACKAUTHOR$, $SERVICEACKCOMMENTS$ and $HOSTACKCOMMENTS$ - described in more detail in the documentation

I'll post the snippets onto the HepSysMan wiki soon...

Thursday, July 05, 2007

LHCb Stuck Jobs

Coincidentally with drafting the stalled jobs document, we got 23 stalled LHCb jobs last Friday. These jobs had consumed about a minute of CPU then just stopped.

I reported them to lhcb-production@cern.ch and the response from LHCb was very swift and helpful. We did quite a bit of debugging on them - although in the end we had to confess that exactly why these ones had stalled was something of a mystery. At first LHCb thought that NFS might have gone wobbly at our end, so the jobs got stuck reading the VO software. From what I could see this was unlikely, and when NIKHEF, RAL and IN2P3 reported similar problems we were off the hook.

Some useful tools for stuck jobs:

lsof - see what file handles are open
strace - what's the job doing right now
gdb - attach a debugger to the code

In fact, a lot of simple diagnostics also help: what's in the job's running directory. What STDOUR/STDERR has been produced to far, etc.

When these jobs are killed it's helpful to poke the stalled process - that way information gets back to the VO. A qdel will see the outputs all lost and the job resubmitted elsewhere, which is far less helpful.

In the end, whatever the bug is, it's down at the 10^-6 level!

Thanks to LHCb for being so responsive.

I also must take my hat off to Paul and his MonAMI torque plugin. His live efficiency plots for the batch system queues made spotting this very easy. In the past this sort of thing would have been noticed on a very hit or miss basis.

Friday, May 25, 2007

Of DPM, MySQL and MonAMI... (part 2)

Paul put MonAMI back onto the DPM yesterday. We saw a very similar rise in the number of MySQL connections as before, but as we were on the ball with this we were able to look at who was connecting via SHOW PROCESSLIST. Turns out that all the extra connections were from DPM itself. MonAMI was not to blame.

Early this morning the number of connections came back down again, which might indicate that under certain circumstances, DPM starts an extra connection to the database which it then does not let go of for some time (the 24 hour slot for any SRM transaction to complete?). I wonder if this might be the cause of the rare putDone failures we saw.

Thanks to MonAMI we'll be able to watch for this, and correlate any failures with how busy MySQL was.

Paul also did some pretty RRD aggregate plots, which are very much easier to read. Thanks! Note how MonAMI is able to distinguish between atlas and atlas/Role=production, which is incredibly useful.

Thursday, May 24, 2007

Jabber Dabba Do!

OK, So in between taking the kids to an Intellectual day out (The new Barnstormer's ace BTW) I have registered a new gmail account for uki.scotgrid.glasgow and played with Net::Jabber on svr031. Took quite a few Perl dependancies to get it working, as you need the IO SSL working for when google switches to TLS.

The supplied nagios "notify_via_jabber" doesn't work out of the box, but I have a simple test script by Thus0 I lifted from the web that works fine - Now all I need to do is rewrite the notify script with the correct incantations from the test one.

Plan is then to have Jabber notifications for certain classes of nagios alert.

UPDATE: Success! added a new test on the disk servers as I knew there were some non-dpm boxes amongst them. Screenshot of ~~gaim~~Pidgin popup below

Monday, May 21, 2007

Of DPM, MySQL and MonAMI...

Paul has installed MonAMI onto out DPM, which has been very useful (and will become more so when we get nagios running again). However, we started to report zero storage over the weekend, which was tracked down to MySQL running out of connections (as DPM doesn't have a monitoring API we have to query the db directly, which is not ideal). When I looked in detail I (eventually) found that MonAMI had eaten all of the MySQL connections by swallowing sockets.

Paul is investigating and seems to have found at least one place where connections could leak (although he's unclear why it was triggered).

However, even stopping MonAMI at 11pm last night didn't entirely resolve the situation. At some point in the early hours MySQL seemed to again run out of connections. This caused some of the DPM threads to go mad and write as fast as they could to the disk. By 6am there was a 2.5GB DPM log file and / was full. Yikes.

This morning I had to stop all of DPM and MySQL, move the giant logfile out of the way, and then do a restart.

Paul will try the fix soon, but this time keep a much closer eye on things.

I believe we should also make sure that /var/log on the servers is a separate large partition in the future. Although we have enough space in / during normal running, clearly an abnormal situation can fill things up pretty quickly - and running out of space on the root file system is not desirable!

Tuesday, April 24, 2007

Glasgow Worker Nodes Filled Up

Browsing the Glasgow gangila plots on Saturday night I noticed a very weird situation, where the load was going over the number of job slots and an increasing amount of CPU was being consumed by the system.

It took a while to work out what was going on, but I eventually tracked it down to /tmp on certain worker nodes getting full - there was an out of control athena.log file in one ATLAS user's jobs which was reaching >50GB. Once /tmp was full it crippled the worker node and other jobs could not start properly - atlasprd jobs untarring into /tmp stalled and the system CPU went through the roof.

It was a serious problem to recover from this - it required the offending user's jobs to be canceled, and a script to be written which cleared out the /tmp space. After that the stalled jobs also had to be qdeled, because they could not recover.

This did work - the load comes back down under the red line and then fills back up as working jobs come in as can be seen from the ganglia plots.

This clearout was done between 2230 and 2400 in a Saturday night, which royally p***ed me off - but I knew that if I left it until Monday the whole site would be crippled.

I raised a GGUS ticket against the offending user. Naturally there wasn't a response until Monday, however it did prove that it is possible to contact a VO user through GGUS.

Lessons to learn: we clearly need to monitor disk on the worker nodes, both /home and /tmp. The natural route to do this is is through MonAMI, with trends monitored in ganglia and alarms in nagios. Of course, we need to get nagios working again on svr031 - the president's brain will be re-inserted next week! In addition, perhaps we want at least a group quota on /tmp, so that VOs can kill themselves but not other users.