Seeing sams post about NFS prompted me to mention 'nmon' - its kinda like 'top' on steroids and does particularly useful trend plotting. Originally a hack for AIX but ported to linux once a certain vendor realised people weren't just buying powerpc systems....
Anyway - go grab from http://www.ibm.com/developerworks/aix/library/au-analyze_aix/ - the linux version is now opensource I see.
Wednesday, November 11, 2009
NFS Load Tweaks: a Brief Guide for the Interested Enthusiast
I was asked about the mystery of NFS server tweaking in a dteam meeting, so I thought I'd compile this brief blog post.
As with all actions, there are two steps: first, gather your information, second, act on this information.
1) Determining your current NFS load statistics.
NFS logs useful information in its /proc entry...
so:
> cat /proc/net/rpc/nfsd
rc 0 28905480 1603148913
fh 133 0 0 0 0
io 3663786355 2268252
th 63 362541 16645.121 3156.556 747.974 280.920 148.129 100.155 61.480
42.249 40.829 90.461
ra 256 1069115586 4089582 3055815 2625032 2228952 2114496 1983622
1765372 1743563 1610465 89609536
net 1634942152 0 1634971040 2214677
rpc 1630024431 0 0 0 0
proc2 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
proc3 22 1573543 1535237104 8743056 1545350887 1532645717 29571823
1179900114 9214599 6691508 538717 366274 0 2801854 39816 505310 4298
2486034 62181794 53164 2414727 0 986878
proc4 2 0 0
This somewhat arcane looking output is full of variously useful
statistics about your nfs daemon.
The "rc" (read cache) field gives the fraction of cache hits, misses
and "nocache" (interactions which bypassed the cache) for read
operations.
The "fh" (file handle) field's most important entry is the first - the
number of stale file handles in the system. If you have flaky NFS, for
example, this will be non-zero.
The io field is simple cumulative io (read, and then written) in bytes.
The "th" (threads) field is the most interesting field for NFS load
optimisation. The first entry is the total number of threads currently
executing. The second is the number of seconds (?) all threads were in use
(which means your NFS was maxed out in active connections). The
remaining 10 entries are a histogram of NFS thread utilisation, in
seconds (it seems to be hard to get NFS to reset this; restarting the
daemon definitely doesn't). Plotting this gives you an idea of how
much time your NFS server spends in various load states.
Ideally, you want the last entry (90-100% use) to be comfortably in
the tail of your distribution...
If you have indications that your server spends a lot of its time with
all threads in use, you should increase the maximum number of threads
- powers of 2 are recommended.
The "ra" (read-ahead cache) field gives similar results, but for the
read-ahead cache. The first number is the size of the cache, the next
10 are a histogram showing how far into the cache entries were found
(so, the first number is the number of times an entry was read from
the first 10% of the cache), and the last is for cache misses.
Obviously, if you're getting a lot of cache misses *and* your cache
hits histogram is heavily right-skewed, it's worth increasing the
cache size. (Conversely, if you have a heavily left-skewed histogram,
and few cache misses, you may be able to manage with a smaller cache.)
The remaining fields are rpc process info fields, which are less
relevant to us for our purposes.
2. Optimising your NFS.
The most important things to ensure are that there are enough
resources for the peak load on your NFS service. NFS will spawn new
threads to handle new active connections, and if its max-threads limit
is too low, you'll get brown-outs under high load.
Starting at least four instances of nfsd per processor (and, on modern
processors, up to 8 per core) is recommended as a sensible
configuration. You can set this on the command line for the nfsd
service by simply using the bare number as an option.
And, of course, if you can bear the risk of data-loss (or silent data
corruption!) on sudden server loss, setting the export option "async"
trivially increases your network throughput by removing the need for
confirmation and syncing of writes between clients and server.
See the NFS config faq at:
http://nfs.sourceforge.net/#section_b
for more details.
You may also wish to do the standard setting of packet sizes with
respect to MTU that you would normally do for a network-based
protocol. The general process (and some more details) are covered at:
http://nfs.sourceforge.net/nfs-howto/ar01s05.html
As with all actions, there are two steps: first, gather your information, second, act on this information.
1) Determining your current NFS load statistics.
NFS logs useful information in its /proc entry...
so:
> cat /proc/net/rpc/nfsd
rc 0 28905480 1603148913
fh 133 0 0 0 0
io 3663786355 2268252
th 63 362541 16645.121 3156.556 747.974 280.920 148.129 100.155 61.480
42.249 40.829 90.461
ra 256 1069115586 4089582 3055815 2625032 2228952 2114496 1983622
1765372 1743563 1610465 89609536
net 1634942152 0 1634971040 2214677
rpc 1630024431 0 0 0 0
proc2 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
proc3 22 1573543 1535237104 8743056 1545350887 1532645717 29571823
1179900114 9214599 6691508 538717 366274 0 2801854 39816 505310 4298
2486034 62181794 53164 2414727 0 986878
proc4 2 0 0
This somewhat arcane looking output is full of variously useful
statistics about your nfs daemon.
The "rc" (read cache) field gives the fraction of cache hits, misses
and "nocache" (interactions which bypassed the cache) for read
operations.
The "fh" (file handle) field's most important entry is the first - the
number of stale file handles in the system. If you have flaky NFS, for
example, this will be non-zero.
The io field is simple cumulative io (read, and then written) in bytes.
The "th" (threads) field is the most interesting field for NFS load
optimisation. The first entry is the total number of threads currently
executing. The second is the number of seconds (?) all threads were in use
(which means your NFS was maxed out in active connections). The
remaining 10 entries are a histogram of NFS thread utilisation, in
seconds (it seems to be hard to get NFS to reset this; restarting the
daemon definitely doesn't). Plotting this gives you an idea of how
much time your NFS server spends in various load states.
Ideally, you want the last entry (90-100% use) to be comfortably in
the tail of your distribution...
If you have indications that your server spends a lot of its time with
all threads in use, you should increase the maximum number of threads
- powers of 2 are recommended.
The "ra" (read-ahead cache) field gives similar results, but for the
read-ahead cache. The first number is the size of the cache, the next
10 are a histogram showing how far into the cache entries were found
(so, the first number is the number of times an entry was read from
the first 10% of the cache), and the last is for cache misses.
Obviously, if you're getting a lot of cache misses *and* your cache
hits histogram is heavily right-skewed, it's worth increasing the
cache size. (Conversely, if you have a heavily left-skewed histogram,
and few cache misses, you may be able to manage with a smaller cache.)
The remaining fields are rpc process info fields, which are less
relevant to us for our purposes.
2. Optimising your NFS.
The most important things to ensure are that there are enough
resources for the peak load on your NFS service. NFS will spawn new
threads to handle new active connections, and if its max-threads limit
is too low, you'll get brown-outs under high load.
Starting at least four instances of nfsd per processor (and, on modern
processors, up to 8 per core) is recommended as a sensible
configuration. You can set this on the command line for the nfsd
service by simply using the bare number as an option.
And, of course, if you can bear the risk of data-loss (or silent data
corruption!) on sudden server loss, setting the export option "async"
trivially increases your network throughput by removing the need for
confirmation and syncing of writes between clients and server.
See the NFS config faq at:
http://nfs.sourceforge.net/#section_b
for more details.
You may also wish to do the standard setting of packet sizes with
respect to MTU that you would normally do for a network-based
protocol. The general process (and some more details) are covered at:
http://nfs.sourceforge.net/nfs-howto/ar01s05.html
Friday, November 06, 2009
Arc, and the installation
We've been fiddling with the NorduGrid Arc middleware a bit. Not just out of random curiosity, but more trying to get a handle on the workloads that it suits better than gLite, and vice versa. It does a number of things differently, and by running an Arc CE in parallel with an lcg-CE and CREAM, we can do some solid comparisons. Oh, and the name of the middleware is also much more amenable to puns, so expect a few groaners too.
So, consider this the first in a series. During this process, we expect to end up with a set of notes on how to install and run an Arc setup, for people already familiar with gLite.
Firstly, install. We took a blank SL5 box, added the nordugrid repo's, and then
yum groupinstall "ARC Server"
yum groupinstall "ARC Client"
Well, very nearly. There's one more thing needed, which is to add the EPEL dependancies (libVOMS is the key lib)
yum install yum-conf-epel
The next step is to configure it. That's all done in /etc/arc.conf, and is the subject for later posts.
There is a need for a filesystem shared between the CE and the worker nodes, so we fired up a spare disk server for NFS.
Startup is three systems, already configured in /etc/init.d : gridftp, grid-infosys and grid-manager.
Ta-da! A running Arc CE.
Ok, so there's a fair bit glossed over in the configuration step. Next time, I'll talk about how I configured it to work with our existing queues - and where the expectations for Arc differ from gLite.
So, consider this the first in a series. During this process, we expect to end up with a set of notes on how to install and run an Arc setup, for people already familiar with gLite.
Firstly, install. We took a blank SL5 box, added the nordugrid repo's, and then
yum groupinstall "ARC Server"
yum groupinstall "ARC Client"
Well, very nearly. There's one more thing needed, which is to add the EPEL dependancies (libVOMS is the key lib)
yum install yum-conf-epel
The next step is to configure it. That's all done in /etc/arc.conf, and is the subject for later posts.
There is a need for a filesystem shared between the CE and the worker nodes, so we fired up a spare disk server for NFS.
Startup is three systems, already configured in /etc/init.d : gridftp, grid-infosys and grid-manager.
Ta-da! A running Arc CE.
Ok, so there's a fair bit glossed over in the configuration step. Next time, I'll talk about how I configured it to work with our existing queues - and where the expectations for Arc differ from gLite.
Friday, October 30, 2009
worker node on demand
Virtualisation is a hot topic again for grid services and worker node on demand
KVM, XEN, VMWARE - Everyone using different ones.
Virtualisation for cloud - Nimbus, Open Nebula, eucalyptus
the future... ??
1. plain signed virtual images transported from site to site.
2. virtual images including experiment software.
3. connecting to pilot job frameworks, instantiated with virtual images,
4. pilot frameworks replaced by commercial domain schedulers. virtual clusters.
KVM, XEN, VMWARE - Everyone using different ones.
Virtualisation for cloud - Nimbus, Open Nebula, eucalyptus
the future... ??
1. plain signed virtual images transported from site to site.
2. virtual images including experiment software.
3. connecting to pilot job frameworks, instantiated with virtual images,
4. pilot frameworks replaced by commercial domain schedulers. virtual clusters.
Monday, October 26, 2009
HEPIX is GO!
HEPIX Workshop
Site Reports Session
CERN:
Getting serious about ITIL. Solaris being phased out. Getting serious about 10GigE.
Lustre pilot project. New purchases discussed.
JLAB:
New LQCG Cluster "2009 Quad Infiniband - ARRA Cluster"
Storage - Whitebox 14 AMAXservers Solaris w/ZFS or Lustre
Compute - DellpowerEdge R4102 x4 Ghz QDR Infiiband, 24Gb RAM
Auger Cluster Upgraded
Nehalems - intel x5530 dual cpu, quad core, 24MB RAM, 500GB SATA
(seeing i/o contention on disk when running 14/16 jobs)
OS Switch from Fedora 8 32bit, to CentOS 5.3 64bit
No real Grid Computing
IBM TS3500 tape library installed. StorageTek Powderhorn silos replaced.
80 production VM's VMWare ESX3.5 planned to move to vSphere4.0
GSI:
FAIR - new accelerator discussion. The futuristic talk!
The Cube DataCentre Building: 1000 19" water cooled racks held in 26x26x26 cube building. Lifts to reach the machines. Iron structure for racks to sit on.
CINP2P3 LYON:
T1 4LHC & D0, Babar, SL5 migration in Q2 2010 for both Main Cluster and MPI Cluster. New Purchases and New Server Building.
STORAGE Session
Your File System NexGen openAFS (Jeffery Altman):
YFS now funded by US Gov to create nextgen openAFS. 2 year funding. Deliverables included assessment of current AFS and 2 year upgrade plan to client and server for YFS deliverable. Still open source.
Storm and Lustre:
IOZONE discussion, Hammer-cloud Tests Discussion, Benchmarking summary, Good Results, performance below iozone tests. WMS jobs and Panda jobs different. file::// protocol support performs well but requires the VO to support it. Open questions: Lustre Striping (should yes or no). Performance (Raid config?), Monitoring - still work to be done, Support - Kernel Upgrades can take a while to be made available and Benchmarks - are they realistic? Tuning still to do.
Lustre at GSI:
Users - Alice Analysis for Tier2, GSI Exp, FAIR Simulations. Still on 1.6.7.2 1Pbtye, > 3000 nodes. Foundry RX32 ethernet switch. MDS HA Pair, one standby. 84 OSS, 200 OSTs. MDS 8 core, 3GHz Xeon, 32Bb RAM. Real throughput testing with Alice Analysis Train. 50Gbit/s using 2000 cores. Hardware and Software issues. Complex system and vulnerable to network communications. Using Robin Hood Filesystem Monitor for audit and management. This protects the MDS by directing requests to MYSQL instance. i.e top ten users, file moves etc. Using this rather than e2Scan.
Hadoop on your worker nodes using local hard drives & Fuse:
Hadoop compared against Lustre. Performed well when 8 jobs ran. Replication of files provides redundancy. Cost and maintenance factor very favourable to small sites. Deployed in some sites in the US. Not a really Tier 1 deployable solution. Name node redundancy exists (will lose at most one transaction) - requires additional software.
Virtualization Session
lxcloud at CERN:
Cern has developed a proof of concept for virtualized worker nodes. 'Golden nodes' serving images to the Xen Hypervisors using Open Nebula. Also looked at Platform's VMO. Production lxcloud being built. 10 machines, 24GB, 2TB disk dual Nehalem. Starting with Xen. Production release by March 2010. Memory an issue as the HyperVisor requires some memory i.e. with 16GB RAM you cannot run 8 2GB VM's.
Fermigrid:
Has moved much of its infrastructure to Xen HyperVisor. Looks like a solid infrastructure. Investigating KVM with the possibility of a move in the next few years if it proves to be better. INFN mentioned Xen vs KVM at Hepix Spring 2009 for discussion of differences.
Site Reports Session
CERN:
Getting serious about ITIL. Solaris being phased out. Getting serious about 10GigE.
Lustre pilot project. New purchases discussed.
JLAB:
New LQCG Cluster "2009 Quad Infiniband - ARRA Cluster"
Storage - Whitebox 14 AMAXservers Solaris w/ZFS or Lustre
Compute - DellpowerEdge R4102 x4 Ghz QDR Infiiband, 24Gb RAM
Auger Cluster Upgraded
Nehalems - intel x5530 dual cpu, quad core, 24MB RAM, 500GB SATA
(seeing i/o contention on disk when running 14/16 jobs)
OS Switch from Fedora 8 32bit, to CentOS 5.3 64bit
No real Grid Computing
IBM TS3500 tape library installed. StorageTek Powderhorn silos replaced.
80 production VM's VMWare ESX3.5 planned to move to vSphere4.0
GSI:
FAIR - new accelerator discussion. The futuristic talk!
The Cube DataCentre Building: 1000 19" water cooled racks held in 26x26x26 cube building. Lifts to reach the machines. Iron structure for racks to sit on.
CINP2P3 LYON:
T1 4LHC & D0, Babar, SL5 migration in Q2 2010 for both Main Cluster and MPI Cluster. New Purchases and New Server Building.
STORAGE Session
Your File System NexGen openAFS (Jeffery Altman):
YFS now funded by US Gov to create nextgen openAFS. 2 year funding. Deliverables included assessment of current AFS and 2 year upgrade plan to client and server for YFS deliverable. Still open source.
Storm and Lustre:
IOZONE discussion, Hammer-cloud Tests Discussion, Benchmarking summary, Good Results, performance below iozone tests. WMS jobs and Panda jobs different. file::// protocol support performs well but requires the VO to support it. Open questions: Lustre Striping (should yes or no). Performance (Raid config?), Monitoring - still work to be done, Support - Kernel Upgrades can take a while to be made available and Benchmarks - are they realistic? Tuning still to do.
Lustre at GSI:
Users - Alice Analysis for Tier2, GSI Exp, FAIR Simulations. Still on 1.6.7.2 1Pbtye, > 3000 nodes. Foundry RX32 ethernet switch. MDS HA Pair, one standby. 84 OSS, 200 OSTs. MDS 8 core, 3GHz Xeon, 32Bb RAM. Real throughput testing with Alice Analysis Train. 50Gbit/s using 2000 cores. Hardware and Software issues. Complex system and vulnerable to network communications. Using Robin Hood Filesystem Monitor for audit and management. This protects the MDS by directing requests to MYSQL instance. i.e top ten users, file moves etc. Using this rather than e2Scan.
Hadoop on your worker nodes using local hard drives & Fuse:
Hadoop compared against Lustre. Performed well when 8 jobs ran. Replication of files provides redundancy. Cost and maintenance factor very favourable to small sites. Deployed in some sites in the US. Not a really Tier 1 deployable solution. Name node redundancy exists (will lose at most one transaction) - requires additional software.
Virtualization Session
lxcloud at CERN:
Cern has developed a proof of concept for virtualized worker nodes. 'Golden nodes' serving images to the Xen Hypervisors using Open Nebula. Also looked at Platform's VMO. Production lxcloud being built. 10 machines, 24GB, 2TB disk dual Nehalem. Starting with Xen. Production release by March 2010. Memory an issue as the HyperVisor requires some memory i.e. with 16GB RAM you cannot run 8 2GB VM's.
Fermigrid:
Has moved much of its infrastructure to Xen HyperVisor. Looks like a solid infrastructure. Investigating KVM with the possibility of a move in the next few years if it proves to be better. INFN mentioned Xen vs KVM at Hepix Spring 2009 for discussion of differences.
Monday, October 19, 2009
Another new VO at Glasgow
Today I finally got time to create a new VO for our new users in Solid State Physics.
Our local wiki page on running CASTEP at Glasgow. Only the MPI version to get working now.
vo.ssp.ac.ukThis is now active across the cluster and users can sign up to the VO from our voms server on svr029 and will be used to host users of CASTEP and other departmental SSP users.
Our local wiki page on running CASTEP at Glasgow. Only the MPI version to get working now.
Monday, October 12, 2009
CASTEP, A Test of True Grid
Along came another users with a requirement for MPI. Can we run it? Well yes you can but remember our interconnects are just plain old Ethernet and nothing fancy like Myrinet or Infiniband. We are not a HPC cluster but an HTC cluster.
So we have been building CASTEP, an f90 code, heavy on the MPI scatter/gather. A test of true grid for any HTC cluster. First off CASTEP requires a minimum of make3.81 and gfortran43. Handy that we moved to SL5 as these are now the standard. Coupled with making sure that the required libs fftw3, blas and lapack are all built with the same compiler, gfortran43. This allowed the single core version to be built and installed onto the grid.
An MPI version is turning out be a bit more work. First off the old, outdated and no longer developed libs MPICH have not been built with .f90 support enabled by default. So we have got hold of the source to do a recompile with .f90 support on for gfortran43. There also appeared to be a bug in the gfortran support. So we had to patch the src rpm to include a patch that we located online. This allowed us to finally build the mpich lib. This has been tested with compilation of an MPI job in c and f90, both of which run successfully.
Unfortunately CASTEP still doesn't run using it so more digging required.
So we have been building CASTEP, an f90 code, heavy on the MPI scatter/gather. A test of true grid for any HTC cluster. First off CASTEP requires a minimum of make3.81 and gfortran43. Handy that we moved to SL5 as these are now the standard. Coupled with making sure that the required libs fftw3, blas and lapack are all built with the same compiler, gfortran43. This allowed the single core version to be built and installed onto the grid.
An MPI version is turning out be a bit more work. First off the old, outdated and no longer developed libs MPICH have not been built with .f90 support enabled by default. So we have got hold of the source to do a recompile with .f90 support on for gfortran43. There also appeared to be a bug in the gfortran support. So we had to patch the src rpm to include a patch that we located online. This allowed us to finally build the mpich lib. This has been tested with compilation of an MPI job in c and f90, both of which run successfully.
Unfortunately CASTEP still doesn't run using it so more digging required.
Thursday, September 24, 2009
gqsub at EGEE09
Just a short note from the EGEE 09 conference. It's been very gratifying to have had so much interest in gqsub at the conference - I even had emails about it scant hours after the poster was put up (and before the offical poster session!).
In response to the comments recieved, I've put a roadmap of planned features up on the gqsub page, which gives an idea of where it's headed.
In addition, v 1.2.0 is out, which implements auto staging back of output. This means that in cases where there is not a shared filesystem between the UI and the worker node, but there is GridFTP server on the UI, then gqsub will pull out the JDL tricks we used earilier with the Lumerical deployment. This results in the illusion of a shared filesystem - the job is submitted, and the output appears in the right places as if it was done in a shared filesystem.
In response to the comments recieved, I've put a roadmap of planned features up on the gqsub page, which gives an idea of where it's headed.
In addition, v 1.2.0 is out, which implements auto staging back of output. This means that in cases where there is not a shared filesystem between the UI and the worker node, but there is GridFTP server on the UI, then gqsub will pull out the JDL tricks we used earilier with the Lumerical deployment. This results in the illusion of a shared filesystem - the job is submitted, and the output appears in the right places as if it was done in a shared filesystem.
Subscribe to:
Posts (Atom)