Saturday, October 18, 2008

ScotGrid Edinburgh progress

Finally we are green for the latest Atlas releases...

We've made a lot of progress this past week with ECDF. It all started on Friday 10th Oct when were trying to solve some Atlas installation problems in a somewhat ad hoc fashion.
We then incorrectly tagged/published having a valid production release. This then caused serious problems with the Atlas jobs, which resulted in us being taken out of the UK production and missing out on a lot of CPU demand. This past week we've been working hard to solve the problem and here are a few things we found:

1) First of all there were a few access problems to the servers for a few of us. So it was hard to see what was actually going on with the mounted atlas software area. Some of this has now been resolved.

2) The installer was taking ages and them timing (proxy and also SGE killing it off eventually). strace on the nodes linked this to a very slow performance while doing many chmod write to the file system. We solved this in a two fold approach
- Alessandro modified the installer script to be more selective regarding which files needs chmoding, but the system was still very slow.
- The nfs export was then changed to allow asynchronous write which helped speed up the tiny writes to the underlying LUN considerably. There is a worry now of possible data corruption, so should be borne in mind if the server goes down and/or we have edinburgh specific segv/problems with a release. Orlando may want to post later information about the nfs changes.

3) The remover and installer used ~ 3GB and 4,5 GB of vmem respectively and the 6GB vmem limit had only been applied to prodatlas jobs. The 3GB vmem default started causing serious problems for sgmatlas. This has now been changed to 6GB.

We're also planning in the ce to add "qsub -m a -M" SGE options to allow the middleware team to monitor better the occurence of vmem aborts. We also might add a flag to help better parse the SGE account logs for apel. Note: the APEL monitoring problem has been fixed. However, that's for another post (Sam?)...

Well done to Orlando, Alessandro, Graeme and Sam for helping us get to the bottom of this!

1 comment:

Orlando Richards said...

On the NFS issues - we spotted the slow chmods with an strace attached to the parent process, along with watching "nfsstat" to see what was happening on NFS.

We have a SAN platform providing the storage for this system, and the performance of that is not the same as one might expect from local disks. It is also not necessarily the same as might be expected from a SAN ("Ooh - you've got a SAN, that must be high performance"), since in this case the storage platform is built from SATA disks. The NFS filesystem is built on a 4+1P RAID5 volume, which is part of a shared storage platform.

For synchronous NFS writes, the latency involved in writing to this SAN kit becomes obvious in the speed of the application. In this situation, the client sends a chmod command to the NFS server, which then sends a data write command to disk, which then writes the data and sends an acknowledgment back to the NFS server, which then sends an acknowledgment back to the NFS client that the operation has completed. The client then sends the next chmod command.

It is also possible that the NFS server waits for a short time to see if any other writes are about to arrive over the network that can be sent in the same command to the disks - we didn't look into this, but using no_wdelay would stop it doing that (possibly at the expense of making more inefficient writes to disk for data operations).

Using asynchronous writes allows the server -> disk communication to be skipped, allowing much faster turnover of commands. Of course, there is no wdelay in this case either. This, in turn, allows the server to also make far more efficient writes to disk when it does so.

Potential problems using async focus around data corruption in the event of a problem with the server - and this includes the underlying SAN services. If, for example, the SAN volume became overloaded with I/O and started spitting out SCSI errors (this has sort of happened in the past), then the data may become corrupted. This ought to be picked up by the software validation process though, and we can do a re-install at this point if required.

We are also keen to look at how to implement some kind of resilience in the NFS services. Our systems are, in general, built from the ground up to provide a resilient service with no single points of failure (and we have almost achieved this) - the NFS services for Atlas break this model. Initial thoughts in this area involve having an rsync'd copy of the data on the other pool server, with (manual) failover controlled by changing a symbolic link that points to the relevant NFS mount point, or possibly adjusting the automounter client config. A resilient setup is desirable not just from a disaster-recovery point of view, but also to allow continuous service during maintenance operations.

Of course, ideally we would do away with NFS entirely and use the GPFS filesystem to provide the requires resilient and high performance storage services. However, we have seen that something in the atlas environment REALLY didn't like our GPFS file system.