Tuesday, March 27, 2007

Rare DPM srmPutDone Failures at Glasgow

Just a place holder really. We failed a SAM RM test this morning, and for the first time in I can recall it was a DPM failure, not an information system problem.

The error message was Setting SRM transfer to 'done' failed: Unregistering alias from catalog.+ result=1.

Looking in my SRM logs clearly something was badly wrong:
03/27 06:04:25 15848,0 put: request by /C=CH/O=CERN/OU=GRID/CN=Piotr Nyczyk 6217
from node139.beowulf.cluster
03/27 06:04:25 15848,0 put: SRM98 - put 389251 389251
03/27 06:04:25 15848,0 put: SRM98 - put 0 srm://svr018.gla.scotgrid.ac.uk/dpm/gl
a.scotgrid.ac.uk/home/ops/generated/2007-03-27/file92308f95-2e1e-43ba-b8bf-da8f1
c7c416e
03/27 06:04:27 15848,0 getRequestStatus: request by /C=CH/O=CERN/OU=GRID/CN=Piot
r Nyczyk 6217 from node139.beowulf.cluster
03/27 06:04:27 15848,0 getRequestStatus: SRM98 - getRequestStatus 389251
03/27 06:04:27 15848,0 getRequestStatus: returns 0
03/27 06:04:28 15848,0 setFileStatus: request by /C=CH/O=CERN/OU=GRID/CN=Piotr N
yczyk 6217 from node139.beowulf.cluster
03/27 06:04:28 15848,0 setFileStatus: SRM98 - setFileStatus 389251 0 Running
03/27 06:04:28 15848,0 setFileStatus: returns 0
03/27 06:04:29 15848,0 setFileStatus: request by /C=CH/O=CERN/OU=GRID/CN=Piotr N
yczyk 6217 from node139.beowulf.cluster
03/27 06:04:33 15848,1 setFileStatus: SRM98 - setFileStatus 389251 0 Done
03/27 06:04:33 15848,1 setFileStatus: dpm_putdone failed: Internal error
03/27 06:04:33 15848,1 setFileStatus: returns 12
03/27 06:04:33 15848,1 srmv1: SRM02 - soap_serve error : Internal error
So this is a bit worrying.

I looked back through the DPM logs and this is not the first time this has happened. The signature is the dpm_putdone failed: Internal error. There have been 10 of these in the last 90 days:

svr018:/var/log/srmv1# zgrep "dpm_putdone failed: Internal" *
log:03/27 06:04:33 15848,1 setFileStatus: dpm_putdone failed: Internal error
log.1:03/26 20:09:59 15848,0 setFileStatus: dpm_putdone failed: Internal error
log.20.gz:03/08 01:00:55 3442,0 setFileStatus: dpm_putdone failed: Internal error
log.21.gz:03/06 15:53:49 3442,1 setFileStatus: dpm_putdone failed: Internal error
log.21.gz:03/06 15:53:49 3442,5 setFileStatus: dpm_putdone failed: Internal error
log.21.gz:03/07 00:08:22 3442,2 setFileStatus: dpm_putdone failed: Internal error
log.53.gz:02/02 09:50:57 3442,0 setFileStatus: dpm_putdone failed: Internal error
log.59.gz:01/27 07:14:37 3442,0 setFileStatus: dpm_putdone failed: Internal error
log.62.gz:01/24 15:03:46 3442,0 setFileStatus: dpm_putdone failed: Internal error
log.83.gz:01/03 06:35:18 3442,1 setFileStatus: dpm_putdone failed: Internal error
The return code is always 12.

I shall raise this with the DPM people...

OK, done. See GGUS #20178. Just to put this in perspective, we've had ~500000 successful srmPutDone calls, so the failure rate is 2x10^-5.

2 comments:

Graeme Stewart said...

Discussing with Sohpie she said that an error code 12 is generic, and asked me to look at the DPM/DPNS logs.

When I did this the culprit was clear: DPM was losing contact with MySQL:

03/27 06:04:33 22088,25 dpm_srv_putdone: DP092 - putdone request by /C=CH/O=CERN/OU=GRID/CN=Piotr Nyczyk 6217 (103,115) from svr018.gla.scotgrid.ac.uk
03/27 06:04:33 22088,25 dpm_srv_putdone: DP098 - putdone 389251 1
03/27 06:04:33 22088,25 dpm_get_req_by_token: mysql_query error: Lost connection to MySQL server during query
03/27 06:04:33 22088,25 dpm_get_req_by_token: Trying to reconnect
03/27 06:04:33 22088,25 dpm_srv_putdone: returns 0, status=DPM_FAILED|1015

Looks like there needs to be a better reconnect strategy.

Hey, I've just realised I'm debugging a grid middleware component at the 10^-5 level! That is damn good. Love DPM!

Graeme Stewart said...

So, Jean-Philippe knows the reason. Here's the response to the ticket:

"Hi Graeme,

I'll try to summarize Jean-Philippe's explanations.

Seems that your maximum idle timeout is too low.

To prevent this in the LFC or the DPNS, we have a solution:
- the first thing done is always the virtual id check
- so, if a MySQL error occurs
- a reconnect is done
- and a retry starts

But for the DPM, we can't do that, as:
- there is no "first thing always done"
- the MySQL error can happen in the middle of several "tasks"
- so, we cannot do a retry from the beginning, because you might end up with inconsistencies.

We'll try to see how to have a "first thing always done", that always contacts the database (like getting the MySQL server version or querying DUAL in Oracle).

Until then, please increase your MySQL "maximum idle timeout".

Thanks, Sophie."

I shall work out a recipe for applying the fix for now. It's a bit annoying that the gLite MySQL 4.1 doesn't come with a working mysql configuration file, but just uses default settings. However, it should now be too hard to do this.