What can we do?

What can we do?
	OOM Killer active but memory not full

Apart from upgrading the server to a newer kernel, we have a few options to try and decrease the chance of this happening again.

Adjust overcommit settings
At this point, the system will still allow userspace programs to use more memory when 4.2GB of memory is in use (the size of 3GB swap plus 60% of 2GB RAM). By that time, it will be slow as a sloth in a tar pit, and a request for new slab stands no chance at all.
If we reduce the swap size to 512MB
root@nfssrv:/home/apprentice# swapoff /dev/mapper/nfssrv-lvswap && echo ok ok root@nfssrv:/home/apprentice# mkswap /dev/mapper/nfssrv-lvswap 512000 && echo ok ok root@nfssrv:/home/apprentice# swapon -a
... and set in /etc/sysctl.conf
```
# 0 = default, 1 = malloc always succeeds, 2 = strict overcommit
vm.overcommit_memory = 2
# commit no more virtual address space than swap + 80% of RAM
vm.overcommit_ratio = 25
	    
```
(And also do echo 25 > /proc/sys/vm/overcommit_ratio)
Then when 1GB RAM is in use by user space, processes won't get any more. That reserves 1GB for the kernel - read: slab.
Increase default number of server threads
After reading some more about NFS performance tuning, we could increase the default number of NFS server threads, although the 8 we have now aren't really busy most of the time. The 50 clients surge was incidental, and we might need more CPU (than one) to make this useful. /etc/default/nfs-kernel-server
```
RPCNFSDCOUNT=30
	    
```
According to the same docs, we could also increase memory available to the request queue through /proc/sys/net/core/rmem_default and /proc/sys/net/core/rmem_max.
More resources
Bluntly adding RAM or CPUs to the machine doesn't seem to make much sense as long as the NFS bug is still using that up.
Tuning Memory
See tuning memory, particularly the part about reclaim ratios.
We can also set /proc/sys/vm/swappiness to e.g. 20 instead of the usual 60. This should lead to the kernel swapping pages out less easily.
Setting /proc/sys/vm/vfs_cache_pressure to 10000 or so instead of the usual 100 should lead to less persistent inode and dentry cache.
Setting /proc/sys/vm/min_free_kbytes to 57510 instead of 5751 would probably have saved us from the first OOM killer occurrence. Whether it would 've postponed it by more than a couple of minutes is another matter.
Also see vm.txt. There are various other parameters we could fiddle with. But I think it's likey that NFS is just behaving badly and hogging all slab. So it won't help us much.