The GOC wiki hinted at a few things which can cause this, but as it only seemed to affect SAM tests (Steve Lloyd jobs and my ATLAS production were both running fine) there wasn't a lot that could really be debugged locally - we had even had a successful SAM test from Rafa's SAM Admin interface at 11am.
One of the classic X509 errors though, is CRLs being out of date. When I checked the CRL files on the CE is was clear something was amiss. A few were dated today, but most were approaching 5 days old. When I ran the CRL updater by hand I got the error:
fetch-crl: 20080507T154251+0100 updating CRL 'CERN Trusted Certification Authority (1d879c6c)'
fetch-crl: 20080507T154251+0100 File /etc/grid-security/certificates//1d879c6c.r0 valid: no
fetch-crl: 20080507T154251+0100 Attempt to overwrite /etc/grid-security/certificates//1d879c6c.r0 failed since the original is not a valid CRL file
What's worse is that not only did this CRL fail to update, it caused the updater script to bomb and no CRL after this point even attempted to update!
In the end I had to delete all of the CRLs on the CE and re-run the script to get fresh copies.
What exactly happened I do not know, but the relevant command in fetch-crl is:
openssl crl -hash -in CERT_REVOKE_FILE -noout -inform PEM -textThis should produce, as a first line, the hash for the CRL, which is the same as the file name. however, testing one of the bad files I got instead:
unable to load CRLSo clearly they had gone evil in some way.
12697:error:0906D066:PEM routines:PEM_read_bio:bad end line:pem_lib.c:731:
I think this is poor behaviour from the script, so I have submitted a GGUS ticket (https://gus.fzk.de/pages/ticket_details.php?ticket=36191).
X509 error messages just suck so badly.