ScotGrid: nfs

Tuesday, January 29, 2008

Keeping up with the Jones'

Well we recently had an incident with our NFS server for the cluster (home / software) locking up and needing a cold power cycle. Due to $vendors setup this takes aaaages (in the order of 20 mins) to go through the BIOS selfcheck (hangs at 053C). $vendor would like to poke around system and perhaps perform bios upgrade. Hmm. Oh well, all 10 disk servers are identical so we'll just drain one down and play - it also gives us chance to upgrade (from 1.6.5) to the latest 1.6.7-mumble DPM.

... or so we thought.


disk032:~# rpm -qa | grep DPM
DPM-gridftp-server-1.6.7-1sec
DPM-rfio-server-1.6.7-2sec.slc3
DPM-client-1.6.7-2sec.slc3

"Thats odd - Graeme have you updated these?" nope - Turns out that yum.nightly cron was auto updating on both the disk servers and some of the grid servers... Gaaah. clickity click and we're all ready to play.

In the meantime, dpm-drain migrated most of the data off the server to the other stash of disks but there were still 69 files that failed with 'Internal error' - Am looking through the DB to try and see if I can pull any more info out

Monday, October 01, 2007

NFS does TCP (unexpectedly)

We got alerted by our users to that fact that svr020 was going very, very slowly. When I logged in it took more than a minute and a simple ls on one of the nfs areas was taking ~20s.

Naturally these damn things always happen when you're away and sitting with a 15% battery in a session which is actually of interest!

Anyway, I couldn't see what was wrong on svr020 - the sluggishness seemed symptomatic rather than tied to any bad process or user. There were clear periods when we hit serious CPU Wait. When I checked on disk037 I found many, many error messages like:

nfsd: too many open TCP sockets, consider increasing the number of nfsd threads
nfsd: last TCP connect from 10.141.0.101:816

When I checked on the worker nodes, sure enough, we seemed to have inadvertently switched to using tcp for nfs - I later found out this is the default mount option on 2.6 kernels (which we've obviously just switched to).

I decided to follow the advice in the error message, and run more nfsd threads, so I created a file /etc/sysconfig/nfs, with RPCNFSDCOUNT=24 (the default is 8). I then (rather nervously) restarted nfs on disk037.

Having done this, the load on svr020 returned to normal and the kernel error messages stopped on disk037. Whew!

Relief could clearly be seen, with the cluster load dropping back down to the number of running processes, instead of being "overloaded" as it had been.

We should use MonAMI to monitor the number of nfs tcp connections on disk037. (Just out of interest the number now is 343, which must have caused problems for 8 nfs server daemons; hence, 24 should scale to ~900 mounts.)

Of course, we could shift back to udp nfs mounts if we wanted. Reading through the NFS FAQ might help us decide.

ScotGrid

Tuesday, January 29, 2008

Keeping up with the Jones'

Monday, October 01, 2007

NFS does TCP (unexpectedly)

Labels

Contributors