Sunday, October 19, 2008

Death to gSOAP...

Even after the successful upgrade of DPM we started to get plagued again by SAM test failures with the generic failure message:
httpg://svr018.gla.scotgrid.ac.uk:8443/srm/managerv1:
CGSI-gSOAP: Error reading token data header: Connection closed

This time they came principally from the SE test, instead of from the CE-rm test.

For a while I wondered if there was a DNS problem, but this seemed unlikely for two reasons:
  1. Durham use the .scotgrid.ac.uk domain, but they don't see errors.
  2. We see the connection in the srmv1 logs, so the host can be resolved.
Then I started to wonder if there was a CRL problem as we occasionally get CRL warnings from SAM WN tests. We have an optimised CRL download system at Glasgow - the CE downloads CRLs as normal, then the remaining nodes mirror the CRLs from the CE. This means we make 1 outbound connection every 6 hours, instead of 150, which seems eminently sensible on a large cluster. However, the default crons for the nodes are 6 hours to process CRLs, which means that CRLs could be up to 12 hours old, in the worst case, on client nodes.

On this suspicion I changed the CE configuration to download CRLs every hour and for the clients do download these from the CE every 4 hours.

I made this change on Friday and, so far, we haven't seen the error again.

My eternal complaint with X509/openssl is why the error is reported as "CGSI-gSOAP: Error reading token data header: Connection closed" and not "CGSI-gSOAP: Error reading token data header: Connection closed [CRL for DN BLAH out of date]".

Is that so very hard to do?

No comments: