Wednesday, May 07, 2008

CRHell...

So, even after fixing our CE disk space problems we still were failing SAM tests. The error was one I'd never seen before: "7 an authentication operation failed".

The GOC wiki hinted at a few things which can cause this, but as it only seemed to affect SAM tests (Steve Lloyd jobs and my ATLAS production were both running fine) there wasn't a lot that could really be debugged locally - we had even had a successful SAM test from Rafa's SAM Admin interface at 11am.

One of the classic X509 errors though, is CRLs being out of date. When I checked the CRL files on the CE is was clear something was amiss. A few were dated today, but most were approaching 5 days old. When I ran the CRL updater by hand I got the error:

fetch-crl[19144]: 20080507T154251+0100 updating CRL 'CERN Trusted Certification Authority (1d879c6c)'
fetch-crl[19144]: 20080507T154251+0100 File /etc/grid-security/certificates//1d879c6c.r0 valid: no
fetch-crl[19144]: 20080507T154251+0100 Attempt to overwrite /etc/grid-security/certificates//1d879c6c.r0 failed since the original is not a valid CRL file

What's worse is that not only did this CRL fail to update, it caused the updater script to bomb and no CRL after this point even attempted to update!

In the end I had to delete all of the CRLs on the CE and re-run the script to get fresh copies.

What exactly happened I do not know, but the relevant command in fetch-crl is:
openssl crl -hash -in CERT_REVOKE_FILE -noout -inform PEM -text
This should produce, as a first line, the hash for the CRL, which is the same as the file name. however, testing one of the bad files I got instead:
unable to load CRL
12697:error:0906D066:PEM routines:PEM_read_bio:bad end line:pem_lib.c:731:
So clearly they had gone evil in some way.

I think this is poor behaviour from the script, so I have submitted a GGUS ticket (https://gus.fzk.de/pages/ticket_details.php?ticket=36191).

X509 error messages just suck so badly.

2 comments:

Paul Millar said...

From the GGUS ticket attachment, it looks like the CRL is truncated.

Assuming this is so, perhaps the CERN server that provides the CRL was rebooted whilst the file was being transfered, or there was a network glitch (with wget using TCP KEEPALIVEs), or the process of writing the CRL on the server is not atomic (and Glasgow was unlucky), or ...

Cheers,
Paul

(who likes a good puzzle as much as the next person)

Graeme Stewart said...

The more prosaic explanation is that the CE ran out of disk space while updating the CRLs, so the file got corrupted then.

CERN spotted the same issue 8 months ago, but the fix is still not released to production. See https://savannah.cern.ch/bugs/index.php?29559 for the gory details...