What can we do?

Apart from upgrading the server to a newer kernel, we have a few options to try and decrease the chance of this happening again.

  1. Adjust overcommit settings

    At this point, the system will still allow userspace programs to use more memory when 4.2GB of memory is in use (the size of 3GB swap plus 60% of 2GB RAM). By that time, it will be slow as a sloth in a tar pit, and a request for new slab stands no chance at all.

    If we reduce the swap size to 512MB

    ... and set in /etc/sysctl.conf

    # 0 = default, 1 = malloc always succeeds, 2 = strict overcommit
    vm.overcommit_memory = 2
    # commit no more virtual address space than swap + 80% of RAM
    vm.overcommit_ratio = 25
    	    

    (And also do echo 25 > /proc/sys/vm/overcommit_ratio)

    Then when 1GB RAM is in use by user space, processes won't get any more. That reserves 1GB for the kernel - read: slab.

  2. Increase default number of server threads

    After reading some more about NFS performance tuning, we could increase the default number of NFS server threads, although the 8 we have now aren't really busy most of the time. The 50 clients surge was incidental, and we might need more CPU (than one) to make this useful. /etc/default/nfs-kernel-server

    RPCNFSDCOUNT=30
    	    

    According to the same docs, we could also increase memory available to the request queue through /proc/sys/net/core/rmem_default and /proc/sys/net/core/rmem_max.

  3. More resources

    Bluntly adding RAM or CPUs to the machine doesn't seem to make much sense as long as the NFS bug is still using that up.

  4. Tuning Memory

    See tuning memory, particularly the part about reclaim ratios.

    We can also set /proc/sys/vm/swappiness to e.g. 20 instead of the usual 60. This should lead to the kernel swapping pages out less easily.

    Setting /proc/sys/vm/vfs_cache_pressure to 10000 or so instead of the usual 100 should lead to less persistent inode and dentry cache.

    Setting /proc/sys/vm/min_free_kbytes to 57510 instead of 5751 would probably have saved us from the first OOM killer occurrence. Whether it would 've postponed it by more than a couple of minutes is another matter.

    Also see vm.txt. There are various other parameters we could fiddle with. But I think it's likey that NFS is just behaving badly and hogging all slab. So it won't help us much.