Monday, October 27, 2008

Don't panic - it's only a test...

Hmm. We had a malicious user's DN on the glasgow system this morning. Am sure that other UKI sites may be affected too. Be careful with your cleanup processes as we missed something the 1st time round. Grr.

Tuesday, October 21, 2008

"Oh no! Not again..."

After being all enthusiastic that the gSOAP errors had been nailed, we failed two SE tests in the last 24 hours. Exactly the same issue as before.

As this error message is so vague it looks like lcg-rollout is our only hope.

I note in passing that Glasgow has one of the most reliable SEs in the UK for ATLAS (2.1% job loss, only beaten by Oxford who have 0.8%; UK average in Q3 was 8% loss) so this is particularly galling.

Shouldn't we be making the results as seen by our real customers rather more important than a once an hour stab in the dark from ops?

Sunday, October 19, 2008

Death to gSOAP...

Even after the successful upgrade of DPM we started to get plagued again by SAM test failures with the generic failure message:
httpg://svr018.gla.scotgrid.ac.uk:8443/srm/managerv1:
CGSI-gSOAP: Error reading token data header: Connection closed

This time they came principally from the SE test, instead of from the CE-rm test.

For a while I wondered if there was a DNS problem, but this seemed unlikely for two reasons:
  1. Durham use the .scotgrid.ac.uk domain, but they don't see errors.
  2. We see the connection in the srmv1 logs, so the host can be resolved.
Then I started to wonder if there was a CRL problem as we occasionally get CRL warnings from SAM WN tests. We have an optimised CRL download system at Glasgow - the CE downloads CRLs as normal, then the remaining nodes mirror the CRLs from the CE. This means we make 1 outbound connection every 6 hours, instead of 150, which seems eminently sensible on a large cluster. However, the default crons for the nodes are 6 hours to process CRLs, which means that CRLs could be up to 12 hours old, in the worst case, on client nodes.

On this suspicion I changed the CE configuration to download CRLs every hour and for the clients do download these from the CE every 4 hours.

I made this change on Friday and, so far, we haven't seen the error again.

My eternal complaint with X509/openssl is why the error is reported as "CGSI-gSOAP: Error reading token data header: Connection closed" and not "CGSI-gSOAP: Error reading token data header: Connection closed [CRL for DN BLAH out of date]".

Is that so very hard to do?

Saturday, October 18, 2008

ScotGrid Edinburgh progress

Finally we are green for the latest Atlas releases...

We've made a lot of progress this past week with ECDF. It all started on Friday 10th Oct when were trying to solve some Atlas installation problems in a somewhat ad hoc fashion.
We then incorrectly tagged/published having a valid production release. This then caused serious problems with the Atlas jobs, which resulted in us being taken out of the UK production and missing out on a lot of CPU demand. This past week we've been working hard to solve the problem and here are a few things we found:

1) First of all there were a few access problems to the servers for a few of us. So it was hard to see what was actually going on with the mounted atlas software area. Some of this has now been resolved.

2) The installer was taking ages and them timing (proxy and also SGE killing it off eventually). strace on the nodes linked this to a very slow performance while doing many chmod write to the file system. We solved this in a two fold approach
- Alessandro modified the installer script to be more selective regarding which files needs chmoding, but the system was still very slow.
- The nfs export was then changed to allow asynchronous write which helped speed up the tiny writes to the underlying LUN considerably. There is a worry now of possible data corruption, so should be borne in mind if the server goes down and/or we have edinburgh specific segv/problems with a release. Orlando may want to post later information about the nfs changes.

3) The remover and installer used ~ 3GB and 4,5 GB of vmem respectively and the 6GB vmem limit had only been applied to prodatlas jobs. The 3GB vmem default started causing serious problems for sgmatlas. This has now been changed to 6GB.

We're also planning in the ce to add "qsub -m a -M" SGE options to allow the middleware team to monitor better the occurence of vmem aborts. We also might add a flag to help better parse the SGE account logs for apel. Note: the APEL monitoring problem has been fixed. However, that's for another post (Sam?)...

Well done to Orlando, Alessandro, Graeme and Sam for helping us get to the bottom of this!

Saturday, October 11, 2008

Well Done Guys!

Well, I was waiting for Mike and Andrew to blog this, but they haven't. They very successfully upgraded Glasgow's DPM to the native 64bit version on Monday last week (when we had upgraded to SL4 only the 32 bit version was available). This was a significant step forwards but required the head node and all of the disk servers to have their OS rebuilt without losing data, and the database restored onto the head node.

It went very well and we were up and running again within 6 hours - no data lost!

We are also seeing an improvement in the SAM test results, with the spurious 'gSOAP' errors which were plaguing us now seemingly having gone (fingers crossed!).

It's terrible that the LHC is not running right now, but it does mean that interventions like this can be done.

Great work guys!

chew 'em up, spit 'em out...

Failed SAM tests all day. When I checked the logs they'd all run on
node006. Logged in and...

Oct 11 16:58:40 node006 smartd[3163]: Device: /dev/hda, 1 Currently unreadable (pending) sectors
Oct 11 17:28:40 node006 smartd[3163]: Device: /dev/hda, 1 Currently unreadable (pending) sectors
Oct 11 17:58:41 node006 smartd[3163]: Device: /dev/hda, 1 Currently unreadable (pending) sectors
Oct 11 18:00:14 node006 pbs_mom: Invalid argument (22) in mem_sum, 5754: get_proc_stat
Oct 11 18:13:23 node006 pbs_mom: Invalid argument (22) in resi_sum, 8121: get_proc_stat
Oct 11 18:28:40 node006 smartd[3163]: Device: /dev/hda, 1 Currently unreadable (pending) sectors
Oct 11 18:58:40 node006 smartd[3163]: Device: /dev/hda, 1 Currently unreadable (pending) sectors
Oct 11 19:28:40 node006 smartd[3163]: Device: /dev/hda, 1 Currently unreadable (pending) sectors
Oct 11 19:44:32 node006 pbs_mom: Invalid argument (22) in resi_sum, 9482: get_proc_stat

Took it offline and immediately we're back.

It's just amazing that one bad node in 142 can kill off a whole site for SAM... it took out 3626 jobs in less than 12 hours.

This is really torque's fault - it should have a bad node sensor at the batch system level.

(As an aside it didn't affect ATLAS production at all, because if a node is so bad that the pilot doesn't start then it never pulls in a real job.)

Friday, October 10, 2008

NGS - Your software is here too!

A long overdue action on me was to assist David to get the NGS Software published correctly. The WLCG software already has the architecture for .list files to be created by the SGM users, but the NGS relies on parsing the contents of /usr/ngs

I'll admit to being totally confused by the interactions of the various BDII components - they are horribly complex and interwoven collection of scripts / providers / plugins / programs. I understand that counselling is available for those who spend too long working with them.

anyway - attempts to get a new plugin to simply provide the NGS software failed horribly and I ended up patching /opt/lcg/libexec/lcg-info-dynamic-software


--- lcg-info-dynamic-software.orig 2007-11-22 14:25:02.000000000 +0000
+++ lcg-info-dynamic-software 2008-10-10 22:02:15.000000000 +0100
@@ -1,8 +1,10 @@
#!/usr/bin/perl -w

use strict;
+use IO::Dir;

my $path="/opt/edg/var/info";
+my $ngspath="/usr/ngs";
my @output; # ldif output that is sent to std out.
my @dirs; # The contents of the path
my @ldif_file; # Content of the static ldif file
@@ -23,7 +25,7 @@
exit 1
}

-#Finds the installed software
+#Finds the installed software (glite)
@dirs=`ls $path`;
foreach(@dirs){
chomp;
@@ -40,6 +42,13 @@
}
}

+# Do the same for the NGS software
+my @tags = sort grep { /^[A-Z0-9]+_?/ } ( IO::Dir->new($ngspath)->read );
+for my $t (@tags) {
+ push @exp_soft, "GlueHostApplicationSoftwareRunTimeEnvironment: NGS-$t\n";
+ push @exp_soft, "GlueHostApplicationSoftwareRunTimeEnvironment: $t\n";
+}
+
#Produces the output from the static ldif file and the install software.
for (@ldif_file){
if(/dn:\s+GlueSubClusterUniqueID=/){

Sunday, October 05, 2008

logs logs logs

Those of you who don't pour over the latest bug reports constantly may have missed that RedHat have fixed 208538 (see http://rhn.redhat.com/errata/RHBA-2008-0703.html)

"logrotate in Red Hat Enterprise Linux 4 did not support the maxage and dateext configuration parameters. Usage of these parameters has been backported and is now available to users of Red Hat Enterprise Linux 4."


basically logrotate-3.7.1-10 works as you'd expect from most other common non-stoneage linuxes and allows logs to be saved in .YYYYMMDD extensions, thus preventing huge renaming sessions nightly and forcing hard-link based backup systems (dirvish) to back up the whole log directory each night.