Monday, May 05, 2008

ECDF down for the moment

ECDF have been having real trouble with GPFS in the last week, which gave us some miserable results (23% pass rate on SAM, c.f., UK average of 75%). For the moment the systems team have suspended job submission and the site went into downtime on Friday.

This may or may not be related to the problems we see with the globus job wrapper code on ECDF, where the GPFS daemon consumes up to 300% CPU due to a strange file access pattern in the job home directory. Sam is working on installing an SL4 CE (based on the GT4 code) to see if this improves matters.

2 comments:

Ian Foster said...

Glad to hear that you are trying GT4. Please let us know if you have any questions. We've been having good success of late with GT4 on TeraGrid, and also making significant improvements to GRAM4 scalability and robustness.

Regards -- Ian Foster.

Graeme Stewart said...

Hello Ian

We are trying out GT4 now. However, it seems to be a generic mismatch between GFPS systems and access patterns requiring lots of small files to be open - GPFS does not perform well in this environment. A couple of sites we spoke to have gone back to NFS for pool account areas for this reason.

Cheers

Graeme