Friday, June 15, 2007

A Less Reliable Week

After weeks of perfection, I'm now picky about even a small number of failures.

We had 4 BDII timeouts this week, which is worrying. My inclination is to give R-GMA a hard stare - its close waits have still been misbehaving again and we see load/network spikes which seem to be R-GMA related. However, the BDII was perfect even when R-GMA was occupying more than 1000 close waits, so perhaps it's somewhat unclear. When I looked with top, it actually seemed that most of the CPU was being consumed by slapdadd and slapd. Well, one to keep an eye on.

We also had a copule of JS failures with the error message "Got a job held event, reason: Globus error 79: connecting to the job manager failed.". The GOC wiki suggests this is probably a networking problem, however, that seems unlikely. Is this the gatekeeper deciding that the presented certificate does not match that of the job submitted? There's also a suggestion that the GLOBUS_TCP_PORT_RANGE might be wrong, but we've never changed this from the default 20000-25000 range, so that also seems unlikely.

Again, this probably requires some detailed examination of the gatekeeper logs to see if the connection got through.

