Friday, April 25, 2008

#include <documentation.h>

I quote from the DPM Developer Documentation

LFC/DPM Database schema
TO DO : describe non straight forward tables/fields....


so, with that in mind, I set about pulling out the number of SRM 2.2 requests vs the no of SRMv1 requests at the site. v1 should be constant (what with all the new users coming onboard) and SRM 2.2 being a rapid increase since we enabled it? well it's not easy to grep from the logs so I thought I'd poke the DB. First off in dpm_db.dpm_req r_type a char(1) field normally has g (get?) and p (put?) but we have just over 1500 rows where type is 'B' (broken?). hmm - all from flavia's DN and clienthost of lxdev25

my plots of the dpm usage are far too spikey to make sense of at the moment, but I'll work on presenting the info a bit clearer.
In the meantime I discovered that it's pretty obvious when we set torque to fill the jobslots in host order (made it easier to drain nodes off) and when we send nodes away to vendors.


Thursday, April 24, 2008

assimilation



I'd noticed that over the last month the load on our DPM headnode had been higher than before we switched on MonAMI checking of the DPM. Of course I instantly blamed the developer of said product. However, I disabled monami to prove that the load went down and lo... no change. Hmm.

I then started working out how to optimise the MySQL memory usage - we have about a 1.8G ascii file when I do a mysqldump -A and yet the innodb file takes up a whopping 4.4G on disk with tiny constantly rolling transaction logs of 5M.

As paul was here at cern (it's the WLCG Workshop this week) we got together to hammer out some changes to our implementation. When we logged onto svr018 (The DPM Headnode) I noticed that monami was running again. Turns out that cfengine was "helpfully" restarting the process for us. Grr.

So, an evening of infrastructure management changes:
- we had i386 monami rpms installed - we'd hard coded the repo path rather than using the $basearch variable in our local mirror.
- we had to ensure that we had backup=false in cfengine - where we had a config_dir directive (such as /etc/monami.d and /etc/nrpe.d) the applications were often trying to use someconfigfile.cfg and someconfigfile.cfg.cfsaved - ditto cron.d etc etc.
- we were sometimes trying to run 64 bit binaries on 32 bit architectures as we'd copied them straight from cfagent (normally nrpe monitors) - We've now using $(ostype) in cfagent which expands to linux_x86_64 and linux_i686 on our machines. Although cfengine sets a hard class of 32_bit and 64_bit but you can't use that in a variable.
- we now have the 'core' nrpe monitors (disk space, load, some process checks) installed on ALL servers not just the workernodes. Ahem. Thought we'd implemented that before.
- we've upgraded to the latest CVS rpm of monami on some nodes and we've got grooovy mysql monitoring. - oh and the load's gone down too.

Tuesday, April 15, 2008

Oh no, not again...

We went though a little rash of SAM test failures last night. This turned out to be an LHCb user who was submitting jobs which filled up the scratch area on the worker nodes and turned them into blackholes.

Obligatory GGUS ticket was raised.

We do alarm against disk space filling up on the worker nodes, but it was still 4 hours before action was taken and the nodes set offline before being cleaned. In that time an awful lot of jobs were destroyed. Make me think we might want to automate the offlining of nodes which run out of disk space, pending investigations.

Saturday, April 12, 2008

Splunk / nagios / logrotate

Well, I upgraded to nagios3 this evening on the cluster and noticed it had a new enable_splunk_integration option in the cgi.cfg - I'd looked at splunk before and thought 'hmm, nice idea, not sure it'll work with the grid stuff' but decided to give it a whirl

first up - nagios gotchas - We had the dag rpm installed which hasn't been updated to the 3.0 let alone the 3.0.1 release so went for the manual compile option. Discovered that the (gd|libjpeg|libpng)-devel packages weren't installed - quickly fixed by yum.

took the ./configure line from the spec as a guide - however it managed to splat the cgi's into /usr/sbin rather than /usr/lib64/nagios/cgi - thanks :-( soon found em and moved em round. seems to be working OK - not installed the newer wlcg monitors yet - thats the next task.

Splunk - looks flash but is it any good? There's no sign of any educational pricing on their website and the 'free' version has one HUGE weakness - no user authorisation / login. Temp workaround of some iptables rules to reduce risk and had a play. Defined /var/log on our central syslog server as a datasource and watched it go.

well, sort of... it promptly filled /opt/splunk as it makes an indexed copy of anything it finds, - I think for a real install we'd need some new space on a disk. secondly it quicky swallowed more than its 500M/day 'free' allowance - grabbed a 30day trial licence of the enterprise version and lo it now complains that I've had 2 licence violations of over 5G/day indexed. Harumph.

not sure if this would settle down once it goes through the backlog of the archived logfiles - perhaps if I implement only a syslog FIFO for it it'd be happier. Also we have the 'traditional' logrotate style of .1 .2 .3 etc rather than the more dirvish friendly dateext option - we should really swap... if the RHEL logrotate supports it :-/

"rpm -q logrotate --changelog" doesnt mention it although its fixed in fedora

The other issue is that splunk thrashes the box as it indexes, and it's just stopped as its filled the disk again. Ho Hum.

Wednesday, April 02, 2008

A long time coming: UKI-SCOTGRID-ECDF on APEL

So, yes, it's probably taken a little longer than it might have, but UKI-SCOTGRID-ECDF is now publishing all its accounting data back to early January.
Here

Of course, ops has a disturbingly high share of the Grid usage at the moment, but hopefully we will start to get ATLAS (and maybe even LHCb) jobs filtering in in the near future...

ECDF running for ATLAS

ECDF have now passed ATLAS production validation. The last link in the chain was ensuring that their SRMv1 endpoint was enabled on the UK's DQ2 server at CERN - this allows the ATLAS data management infrastructure to move input data to ECDF.

After that problem was corrected this morning the input data was moved from RAL, a production pilot picked up the job and ran it then the output data was moved back to RAL.

I have asked the ATLAS production people to enable ECDF in the production system and I have turned up the pilot rate to pull in more jobs.

We had a problem with the software area not being group writable (for some reason Alessando's account mapping changed), but this has now been corrected and an install of 14.0.0 has been started.

It's wonderful to now have the prospect of running significant amounts of grid work on the ECDF resource. Well done to everyone!

Only one bite at the cherry...

I have modified the default RetryCount on our UIs to now set zero retries. Automatic retries were actually working quite well for us when we were losing a lot of nodes to MCE errors (in the days before the upgrade to SL4, x86_64) - users' jobs would automatically rerun if they got lost and there was no need for them to worry about failures. However, recently we see users submitting more problematic jobs to the cluster - some which fail to start at all, some which run off into wallclock limits, others which stall half way through. Often we have to gut the batch system with our special spoon and in this case having to do it four times because the RB/WMS keeps resubmitting the job is less then helpful.

For once cfengine's editfiles stanza was useful and a simple:

ui::
{ /opt/glite/etc/glite_wmsui_cmd_var.conf
ReplaceFirst "RetryCount\s+=\s+[1-9];" With "RetryCount = 0;"
}
{ /opt/egd/etc/edg_wl_ui_cmd_var.conf
ReplaceFirst "RetryCount\s+=\s+[1-9];" With "RetryCount = 0;"
}

got the job done.

Tuesday, April 01, 2008

Emergency Outage

Due to a vulnerability to the security flag in the ipv4 header, we will be taking uki-scotgrid-glasgow offline today for an upgrade. In order to minimise downtime we shall be rebooting all worker nodes simultaneously rather than draining queues.

We aim to have this work completed by 12:00 midday today.

Please see RFC 3514 for more details. We advise other sites to perform this upgrade asap.