Tuesday, August 14, 2007

DPM Gridftp Resource Consumption

Durham was suffering from excessive resource consumption, from "hung" dpm.gsiftp connections from ATLAS transfers. Because of the way that gridftp v1 servers work huge buffers were being held in memory, leading to resource exhaustion on the machine and a subsequent crash.

Phil and I discussed this, and I noticed that the active network connections were to the RAL FTS server, not to the source SRM, so it looked like it was the control channel which was hung open, not the data channel.

Greig had a look on Glasgow's servers and discovered the same problem, but we were relatively unaffected due to the whopping 8GB of RAM we have in each disk server (and by having 9 disk servers, presumably). Cambridge also reported problems.

The issue is being looked at by the DPM developers, but for the moment Phil's had to write a cron script to kill off the hung ftps to keep gallow's head above water.