Finally we are green for the latest Atlas releases...
We've made a lot of progress this past week with ECDF. It all started on Friday 10th Oct when were trying to solve some Atlas installation problems in a somewhat ad hoc fashion.
We then incorrectly tagged/published having a valid production release. This then caused serious problems with the Atlas jobs, which resulted in us being taken out of the UK production and missing out on a lot of CPU demand. This past week we've been working hard to solve the problem and here are a few things we found:
1) First of all there were a few access problems to the servers for a few of us. So it was hard to see what was actually going on with the mounted atlas software area. Some of this has now been resolved.
2) The installer was taking ages and them timing (proxy and also SGE killing it off eventually). strace on the nodes linked this to a very slow performance while doing many chmod write to the file system. We solved this in a two fold approach
- Alessandro modified the installer script to be more selective regarding which files needs chmoding, but the system was still very slow.
- The nfs export was then changed to allow asynchronous write which helped speed up the tiny writes to the underlying LUN considerably. There is a worry now of possible data corruption, so should be borne in mind if the server goes down and/or we have edinburgh specific segv/problems with a release. Orlando may want to post later information about the nfs changes.
3) The remover and installer used ~ 3GB and 4,5 GB of vmem respectively and the 6GB vmem limit had only been applied to prodatlas jobs. The 3GB vmem default started causing serious problems for sgmatlas. This has now been changed to 6GB.
We're also planning in the ce to add "qsub -m a -M" SGE options to allow the middleware team to monitor better the occurence of vmem aborts. We also might add a flag to help better parse the SGE account logs for apel. Note: the APEL monitoring problem has been fixed. However, that's for another post (Sam?)...
Well done to Orlando, Alessandro, Graeme and Sam for helping us get to the bottom of this!