OOM Killer active but memory not full

January 2013


Table of Contents

Hypothesis
What can we do?
Background reading

We 've got an NFS4 server, and it's read-only, as you can see in /etc/exports:

/exports           192.152.0.0/16(ro,async,root_squash,no_subtree_check,fsid=0,crossmnt)
/exports/netapps   192.152.0.0/16(ro,async,root_squash,no_subtree_check) 10.0.0.0/8(ro,async,root_squash,no_subtree_check) 172.16.0.0/12(ro,async,root_squash,no_subtree_check) 192.168.0.0/16(ro,async,root_squash,no_subtree_check)
    

The exported filesystem is ext4 on top of LVM, and the machine itself is an ESX VM with one CPU, 2GB of RAM and 3GB of swap.

One monday morning, the OOM Killer lashes out:

Jan 14 08:15:17 nfssrv kernel: [2151974.493200] nfsd invoked oom-killer: gfp_mask=0x40d0, order=1, oom_adj=0
Jan 14 08:15:18 nfssrv kernel: [2151974.493208] nfsd cpuset=/ mems_allowed=0
Jan 14 08:15:18 nfssrv kernel: [2151974.493214] Pid: 1849, comm: nfsd Not tainted 2.6.32-5-amd64 #1
Jan 14 08:15:18 nfssrv kernel: [2151974.493217] Call Trace:
Jan 14 08:15:18 nfssrv kernel: [2151974.493229]  [<ffffffff810b643c>] ? oom_kill_process+0x7f/0x23f
Jan 14 08:15:18 nfssrv kernel: [2151974.493237]  [<ffffffff8106bb5e>] ? timekeeping_get_ns+0xe/0x2e
Jan 14 08:15:18 nfssrv kernel: [2151974.493242]  [<ffffffff810b6960>] ? __out_of_memory+0x12a/0x141
Jan 14 08:15:18 nfssrv kernel: [2151974.493247]  [<ffffffff810b6ab7>] ? out_of_memory+0x140/0x172
Jan 14 08:15:18 nfssrv kernel: [2151974.493253]  [<ffffffff810ba81c>] ? __alloc_pages_nodemask+0x4ec/0x5fc
Jan 14 08:15:18 nfssrv kernel: [2151974.493264]  [<ffffffff810e6e7e>] ? new_slab+0x5b/0x1ca
Jan 14 08:15:18 nfssrv kernel: [2151974.493271]  [<ffffffff810e71dd>] ? __slab_alloc+0x1f0/0x39b
Jan 14 08:15:18 nfssrv kernel: [2151974.493285]  [<ffffffffa03e6170>] ? xs_setup_xprt+0xd7/0x151 [sunrpc]
Jan 14 08:15:18 nfssrv kernel: [2151974.493290]  [<ffffffff810e84bf>] ? __kmalloc+0xf1/0x141
Jan 14 08:15:18 nfssrv kernel: [2151974.493300]  [<ffffffffa03e6170>] ? xs_setup_xprt+0xd7/0x151 [sunrpc]
Jan 14 08:15:18 nfssrv kernel: [2151974.493309]  [<ffffffffa03e6170>] ? xs_setup_xprt+0xd7/0x151 [sunrpc]
Jan 14 08:15:18 nfssrv kernel: [2151974.493317]  [<ffffffff812432ba>] ? sock_common_recvmsg+0x30/0x45
Jan 14 08:15:18 nfssrv kernel: [2151974.493327]  [<ffffffffa03e65e2>] ? xs_setup_tcp+0x15/0x1f4 [sunrpc]
Jan 14 08:15:18 nfssrv kernel: [2151974.493337]  [<ffffffffa03e42e2>] ? xprt_create_transport+0x32/0x23a [sunrpc]
Jan 14 08:15:18 nfssrv kernel: [2151974.493346]  [<ffffffffa03e3cf3>] ? rpc_create+0xbc/0x4f1 [sunrpc]
Jan 14 08:15:18 nfssrv kernel: [2151974.493352]  [<ffffffff81065046>] ? autoremove_wake_function+0x0/0x2e
Jan 14 08:15:18 nfssrv kernel: [2151974.493383]  [<ffffffffa04cc5c9>] ? setup_callback_client+0x101/0x13d [nfsd]
Jan 14 08:15:18 nfssrv kernel: [2151974.493394]  [<ffffffffa04cc61c>] ? nfsd4_probe_callback+0x17/0x57 [nfsd]
Jan 14 08:15:18 nfssrv kernel: [2151974.493405]  [<ffffffffa04c8eab>] ? nfsd4_setclientid_confirm+0x2b0/0x33e [nfsd]
Jan 14 08:15:18 nfssrv kernel: [2151974.493416]  [<ffffffffa04bc729>] ? nfsd4_proc_compound+0x1fc/0x3b2 [nfsd]
Jan 14 08:15:18 nfssrv kernel: [2151974.493426]  [<ffffffffa04ae329>] ? nfsd_dispatch+0xdd/0x1b9 [nfsd]
Jan 14 08:15:18 nfssrv kernel: [2151974.493439]  [<ffffffffa03eb563>] ? svc_process+0x403/0x627 [sunrpc]
Jan 14 08:15:18 nfssrv kernel: [2151974.493448]  [<ffffffffa04ae772>] ? nfsd+0x0/0x12e [nfsd]
Jan 14 08:15:18 nfssrv kernel: [2151974.493457]  [<ffffffffa04ae857>] ? nfsd+0xe5/0x12e [nfsd]
Jan 14 08:15:18 nfssrv kernel: [2151974.493462]  [<ffffffff81064d79>] ? kthread+0x79/0x81
Jan 14 08:15:18 nfssrv kernel: [2151974.493469]  [<ffffffff81011baa>] ? child_rip+0xa/0x20
Jan 14 08:15:18 nfssrv kernel: [2151974.493473]  [<ffffffff81064d00>] ? kthread+0x0/0x81
Jan 14 08:15:18 nfssrv kernel: [2151974.493478]  [<ffffffff81011ba0>] ? child_rip+0x0/0x20
Jan 14 08:15:18 nfssrv kernel: [2151974.493480] Mem-Info:
Jan 14 08:15:18 nfssrv kernel: [2151974.493483] Node 0 DMA per-cpu:
Jan 14 08:15:18 nfssrv kernel: [2151974.493488] CPU    0: hi:    0, btch:   1 usd:   0
Jan 14 08:15:18 nfssrv kernel: [2151974.493490] Node 0 DMA32 per-cpu:
Jan 14 08:15:18 nfssrv kernel: [2151974.493494] CPU    0: hi:  186, btch:  31 usd:  64
Jan 14 08:15:18 nfssrv kernel: [2151974.493502] active_anon:0 inactive_anon:34 isolated_anon:0
Jan 14 08:15:18 nfssrv kernel: [2151974.493504]  active_file:65 inactive_file:274 isolated_file:0
Jan 14 08:15:18 nfssrv kernel: [2151974.493506]  unevictable:0 dirty:32 writeback:0 unstable:0
Jan 14 08:15:18 nfssrv kernel: [2151974.493507]  free:37141 slab_reclaimable:62795 slab_unreclaimable:409427
Jan 14 08:15:18 nfssrv kernel: [2151974.493509]  mapped:19 shmem:0 pagetables:668 bounce:0
Jan 14 08:15:18 nfssrv kernel: [2151974.493512] Node 0 DMA free:8036kB min:40kB low:48kB high:60kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15304kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:32kB slab_unreclaimable:7804kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jan 14 08:15:18 nfssrv kernel: [2151974.493529] lowmem_reserve[]: 0 2004 2004 2004
Jan 14 08:15:18 nfssrv kernel: [2151974.493535] Node 0 DMA32 free:140528kB min:5704kB low:7128kB high:8556kB active_anon:0kB inactive_anon:136kB active_file:260kB inactive_file:1096kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052256kB mlocked:0kB dirty:128kB writeback:0kB mapped:76kB shmem:0kB slab_reclaimable:251148kB slab_unreclaimable:1629904kB kernel_stack:904kB pagetables:2672kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:98 all_unreclaimable? yes
Jan 14 08:15:18 nfssrv kernel: [2151974.493552] lowmem_reserve[]: 0 0 0 0
Jan 14 08:15:18 nfssrv kernel: [2151974.493557] Node 0 DMA: 3*4kB 1*8kB 1*16kB 2*32kB 2*64kB 1*128kB 2*256kB 2*512kB 2*1024kB 2*2048kB 0*4096kB = 8036kB
Jan 14 08:15:18 nfssrv kernel: [2151974.493570] Node 0 DMA32: 34452*4kB 300*8kB 4*16kB 0*32kB 0*64kB 0*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 140528kB
Jan 14 08:15:18 nfssrv kernel: [2151974.493582] 380 total pagecache pages
Jan 14 08:15:18 nfssrv kernel: [2151974.493584] 25 pages in swap cache
Jan 14 08:15:18 nfssrv kernel: [2151974.493587] Swap cache stats: add 2099460, delete 2099435, find 543315/1176734
Jan 14 08:15:18 nfssrv kernel: [2151974.493590] Free swap  = 3465396kB
Jan 14 08:15:18 nfssrv kernel: [2151974.493592] Total swap = 3489784kB
Jan 14 08:15:18 nfssrv kernel: [2151974.507001] 524272 pages RAM
Jan 14 08:15:18 nfssrv kernel: [2151974.507005] 9054 pages reserved
Jan 14 08:15:18 nfssrv kernel: [2151974.507007] 319 pages shared
Jan 14 08:15:18 nfssrv kernel: [2151974.507009] 477519 pages non-shared
Jan 14 08:15:18 nfssrv kernel: [2151974.507014] Out of memory: kill process 11440 (pickup) score 8346 or a child
Jan 14 08:15:18 nfssrv kernel: [2151974.507225] Killed process 11440 (pickup)

In the half hour that follows, the OOM killer is provoked again and again, until things get so bad I have to restart the server by pushing the (virtual) button.

I did not expect this. /proc/sys/vm/overcommit_memory has a value of 2, and /proc/sys/vm/overcommit_ratio is 80. So I'd think that only when memory usage would be swap+80%RAM, the OOM killer would wake up. As you can see, hardly any swap was in use.

After reboot, ext4_inode_cache in slabtop output soon goes to 700MB, and dentry to 100MB:

nfssrv:/home/apprentice# slabtop -o|egrep 'dentry|ext4_inode'
851781 851775  99%    0.19K  40561       21    162244K dentry                 
707920 707920 100%    0.96K  88490        8    707920K ext4_inode_cache       
nfssrv:/home/apprentice# sync && for (( i = 1 ; i < 4 ; i++ )) ; do echo $i > /proc/sys/vm/drop_caches ; sleep .1 ; done
nfssrv:/home/apprentice# slabtop -o|egrep 'dentry|ext4_inode'
 23352  10041  42%    0.19K   1112       21      4448K dentry                 
  2888   1298  44%    0.96K    361        8      2888K ext4_inode_cache 

That's ok, this is a file server, so it should cache the FS.

Judging from the logs above, an NFS kernel thread has provoked the OOM Killer while trying to get more slab. Prior to the OOM event, the NFS server had more traffic than during the night:

nfssrv:/home/apprentice# sar -n NFSD
Linux 2.6.32-5-amd64 (nfssrv)    01/14/2013      _x86_64_        (1 CPU)

12:00:02 AM   scall/s badcall/s  packet/s     udp/s     tcp/s     hit/s    miss/s   sread/s  swrite/s saccess/s sgetatt/s
<snip>
05:35:02 AM    131.74      0.00    131.74      0.00    131.74      0.00      0.00    129.83      0.00      0.48      1.81
05:45:02 AM    131.71      0.00    131.71      0.00    131.71      0.00      0.00    131.46      0.00      0.03      0.08
05:55:02 AM    134.57      0.00    134.57      0.00    134.57      0.00      0.00    132.71      0.00      0.47      1.78
06:05:02 AM    143.58      0.00    143.59      0.00    143.58      0.00      0.00    141.70      0.00      0.47      1.80
06:15:03 AM    125.96      0.00    125.96      0.00    125.96      0.00      0.00    125.71      0.00      0.03      0.08
06:25:03 AM    133.80      0.00    133.80      0.00    133.80      0.00      0.00    131.94      0.00      0.46      1.78
06:35:02 AM    127.72      0.00    127.72      0.00    127.72      0.00      0.00    125.84      0.00      0.47      1.80
06:45:02 AM    124.29      0.00    124.29      0.00    124.29      0.00      0.00    124.03      0.00      0.03      0.10
06:55:02 AM    119.01      0.00    119.01      0.00    119.01      0.00      0.00    117.13      0.00      0.47      1.79
07:05:02 AM     40.60      0.00     40.60      0.00     40.60      0.00      0.00     38.68      0.00      0.49      1.84
07:15:03 AM      6.95      0.00      6.95      0.00      6.95      0.00      0.00      6.69      0.00      0.03      0.09
07:25:02 AM     41.31      0.00     41.31      0.00     41.31      0.00      0.00     39.40      0.00      0.48      1.83
07:35:05 AM     28.62      0.00     28.62      0.00     28.62      0.00      0.00     26.70      0.00      0.49      1.84
07:45:02 AM     31.13      0.00     31.13      0.00     31.13      0.00      0.00     30.85      0.00      0.04      0.12
07:55:03 AM    600.85      0.00    600.85      0.00    600.85      0.00      0.00     32.10      0.00     82.86    477.65
08:05:03 AM   1330.81      0.00   1330.84      0.00   1330.80      0.00      0.00     24.82      0.00    219.06   1281.26
08:15:03 AM   1764.18      0.00   1764.24      0.00   1764.17      0.00      0.00     24.69      0.00    294.29   1701.70
08:25:07 AM   1675.89      0.00   1675.90      0.00   1675.89      0.00      0.00     97.18      0.00    242.91   1422.20
08:35:04 AM   2139.22      0.00   2139.22      0.00   2139.22      0.00      0.00     23.80      0.00    353.34   2051.31
08:45:13 AM   1517.72      0.00   1517.73      0.00   1517.72      0.00      0.00     22.08      0.00    237.56   1399.89
08:55:19 AM   1754.02      0.00   1754.02      0.00   1754.02      0.00      0.00     12.05      0.00    286.85   1653.07
Average:       293.62      0.00    293.63      0.00    293.62      0.00      0.00     94.70      0.00     32.47    188.32

09:27:24 AM       LINUX RESTART

09:35:01 AM   scall/s badcall/s  packet/s     udp/s     tcp/s     hit/s    miss/s   sread/s  swrite/s saccess/s sgetatt/s
09:45:02 AM    583.65      0.00    583.65      0.00    583.65      0.00      0.00     62.80      0.00     75.15    438.14
09:55:01 AM    147.19      0.00    147.19      0.00    147.19      0.00      0.00     70.28      0.00     12.16     63.67
10:05:01 AM    671.96      0.00    671.96      0.00    671.96      0.00      0.00     95.46      0.00     86.02    483.80
10:15:01 AM    752.61      0.00    752.61      0.00    752.61      0.00      0.00    100.95      0.00     92.39    550.58
10:25:01 AM    208.10      0.00    208.10      0.00    208.10      0.00      0.00    106.95      0.00     15.65     86.23

Later on I see in my rootmail that between 08:15:21 and 08:15:31, fifty NFS clients request the same file. But the rise in traffic had already started as early as 7:55:03.

I/O goes up too:

nfssrv:/home/apprentice# sar -f ~/sa14 -b -s 06:00:00 -e 10:00:00
Linux 2.6.32-5-amd64 (nfssrv)    01/14/2013      _x86_64_        (1 CPU)

06:05:02 AM       tps      rtps      wtps   bread/s   bwrtn/s
06:15:03 AM    115.00    111.68      3.32   8204.66     42.36
06:25:03 AM    129.68    123.30      6.38   9123.76     76.53
06:35:02 AM    266.88    250.55     16.33  10785.91    218.79
06:45:02 AM    117.14    113.92      3.22   8755.04     40.89
06:55:02 AM    131.45    124.63      6.83   8834.95     81.90
07:05:02 AM     63.23     57.17      6.06   3694.20     72.38
07:15:03 AM     13.74     12.29      1.45    683.47     17.95
07:25:02 AM     80.82     73.73      7.09   4092.92     85.30
07:35:05 AM     70.50     64.70      5.80   3065.11     69.16
07:45:02 AM     58.16     54.82      3.35   3138.49     42.71
07:55:03 AM    558.41    550.25      8.16   9987.06     96.02
08:05:03 AM   1074.79   1066.87      7.91  15100.50     88.75
08:15:03 AM   1605.11   1596.17      8.93  21859.32     97.98
08:25:07 AM   1520.93   1510.66     10.27  28636.57    111.14
08:35:04 AM   2037.42   2023.51     13.91  27100.50    152.89
08:45:13 AM   1270.55   1263.13      7.42  17039.51     87.77
08:55:19 AM   1566.63   1544.13     22.50  19718.10    259.13
Average:       629.25    621.07      8.18  11764.74     96.62

09:27:24 AM       LINUX RESTART

09:35:01 AM       tps      rtps      wtps   bread/s   bwrtn/s
09:45:02 AM      1.65      0.35      1.30     28.69     14.05
09:55:01 AM     23.47      8.70     14.77    380.06    197.41
Average:        12.55      4.52      8.03    204.08    105.58

but wait time stays reasonable:

nfssrv:/home/apprentice# sar -f ~/sa14 -s 06:00:00 -e 10:00:00 -d |egrep 'DEV|254-2|RESTART|^$'

06:05:02 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
06:15:03 AM  dev254-2     46.73   3710.40      0.00     79.40      0.10      2.12      1.15      5.39
06:25:03 AM  dev254-2     49.73   4040.60      0.00     81.26      0.13      2.55      1.40      6.95
06:35:02 AM  dev254-2    118.68   4484.30      0.00     37.79      0.49      4.10      0.93     11.05
06:45:02 AM  dev254-2     48.03   3971.01      0.00     82.69      0.12      2.52      1.31      6.30
06:55:02 AM  dev254-2     45.69   3823.96      0.00     83.69      0.13      2.76      1.44      6.60
07:05:02 AM  dev254-2     17.08   1336.11      0.00     78.23      0.06      3.76      2.14      3.66
07:15:03 AM  dev254-2      2.94    227.87      0.00     77.51      0.02      5.67      2.57      0.76
07:25:02 AM  dev254-2     23.61   1449.37      0.00     61.39      0.07      3.00      1.66      3.92
07:35:05 AM  dev254-2     16.33    986.10      0.00     60.38      0.07      4.49      2.60      4.25
07:45:02 AM  dev254-2     18.77   1173.61      0.00     62.52      0.05      2.83      1.49      2.80
07:55:03 AM  dev254-2    380.33   4155.44      0.05     10.93      1.20      3.15      0.45     17.07
08:05:03 AM  dev254-2    743.06   6853.74      0.05      9.22      3.24      4.36      0.55     41.02
08:15:03 AM  dev254-2   1145.24  10039.19      0.05      8.77      4.60      4.01      0.34     38.97
08:25:07 AM  dev254-2   1097.08  13148.82      0.37     11.99      5.53      5.04      0.37     40.51
08:35:04 AM  dev254-2   1466.55  12545.66      1.54      8.56      8.42      5.74      0.41     60.71
08:45:13 AM  dev254-2    910.94   8008.72     14.38      8.81      3.27      3.59      0.39     35.69
08:55:19 AM  dev254-2   1105.53   9200.08     78.08      8.39      5.99      5.42      0.44     48.74
Average:     dev254-2    426.41   5249.78      5.61     12.32      1.97      4.62      0.46     19.69

09:27:24 AM       LINUX RESTART

09:35:01 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
09:45:02 AM  dev254-2      0.12      9.60      0.04     80.33      0.00     17.44      9.56      0.11
09:55:01 AM  dev254-2      0.00      0.13      0.01     44.00      0.00     60.00     60.00      0.02
Average:     dev254-2      0.06      4.87      0.03     79.35      0.00     18.59     10.92      0.07

There is hardly any movement in the paging:

nfssrv:/home/apprentice# sar -f ~/sa14 -s 06:00:00 -e 10:00:00 -R
Linux 2.6.32-5-amd64 (nfssrv)    01/14/2013      _x86_64_        (1 CPU)

06:05:02 AM   frmpg/s   bufpg/s   campg/s
06:15:03 AM      0.84     -0.01     -0.67
06:25:03 AM     -0.45     -0.00     -0.27
06:35:02 AM     -0.20      0.01     -0.01
06:45:02 AM     -0.09      0.00      0.35
06:55:02 AM     -0.17     -0.01     -0.21
07:05:02 AM     -0.79      0.03      0.54
07:15:03 AM      0.72     -0.02     -0.95
07:25:02 AM     -1.58      0.01      1.55
07:35:05 AM     -1.29      0.01      0.82
07:45:02 AM      2.19     -0.01     -2.18
07:55:03 AM      0.44      0.41     -0.84
08:05:03 AM     -1.72     -0.34      1.43
08:15:03 AM      0.99      0.05     -1.36
08:25:07 AM      0.55      0.23     -0.46
08:35:04 AM     -1.08      0.36      0.50
08:45:13 AM      1.03     -0.72     -0.41
08:55:19 AM     -0.98      0.74      0.42
Average:        -0.09      0.04     -0.10

09:27:24 AM       LINUX RESTART

09:35:01 AM   frmpg/s   bufpg/s   campg/s
09:45:02 AM     -2.33      0.23      1.79
09:55:01 AM    -33.36      2.46     25.36
Average:       -17.82      1.34     13.56

CPU usage shows more and more time spent by the kernel from 7:55 onwards:

nfssrv:/home/apprentice# sar -f ~/sa14 -s 06:00:00 -e 10:00:00 -u
Linux 2.6.32-5-amd64 (nfssrv)    01/14/2013      _x86_64_        (1 CPU)

06:05:02 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
06:15:03 AM     all      0.01      0.00      3.54     77.09      0.00     19.36
06:25:03 AM     all      0.01      0.00      3.83     80.60      0.00     15.57
06:35:02 AM     all      0.22      0.00      3.94     80.94      0.00     14.90
06:45:02 AM     all      0.01      0.00      3.48     81.13      0.00     15.39
06:55:02 AM     all      0.05      0.00      3.40     76.04      0.00     20.51
07:05:02 AM     all      0.01      0.00      1.20     36.21      0.00     62.59
07:15:03 AM     all      0.00      0.00      0.20      7.30      0.00     92.50
07:25:02 AM     all      0.01      0.00      1.15     38.09      0.00     60.75
07:35:05 AM     all      0.08      0.00      0.83     30.92      0.00     68.17
07:45:02 AM     all      0.01      0.00      0.88     31.76      0.00     67.35
07:55:03 AM     all      0.01      0.00     10.08     57.73      0.00     32.17
08:05:03 AM     all      0.02      0.00     19.03     68.92      0.00     12.04
08:15:03 AM     all      0.01      0.00     23.93     60.30      0.00     15.75
08:25:07 AM     all      0.09      0.00     23.15     54.33      0.00     22.43
08:35:04 AM     all      0.05      0.00     28.71     71.20      0.00      0.04
08:45:13 AM     all      0.08      0.00     21.22     66.21      0.00     12.49
08:55:19 AM     all      0.14      0.00     23.54     70.15      0.00      6.17
Average:        all      0.05      0.00     10.14     58.18      0.00     31.64

09:27:24 AM       LINUX RESTART

09:35:01 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
09:45:02 AM     all      0.09      0.00      6.86      0.20      0.00     92.85
09:55:01 AM     all      0.68      0.03      7.30      2.16      0.00     89.84
Average:        all      0.38      0.01      7.08      1.18      0.00     91.35

Context switching goes rather crazy:

nfssrv:/home/apprentice# sar -f ~/sa14 -s 06:00:00 -e 10:00:00 -w
Linux 2.6.32-5-amd64 (nfssrv)    01/14/2013      _x86_64_        (1 CPU)

06:05:02 AM    proc/s   cswch/s
06:15:03 AM      0.02    369.07
06:25:03 AM      0.03    401.97
06:35:02 AM      0.24    486.13
06:45:02 AM      0.02    366.99
06:55:02 AM      0.02    376.08
07:05:02 AM      0.02    172.42
07:15:03 AM      0.01     50.93
07:25:02 AM      0.02    216.06
07:35:05 AM      0.02    171.07
07:45:02 AM      0.02    164.47
07:55:03 AM      0.02   2690.00
08:05:03 AM      0.02   5684.55
08:15:03 AM      0.02   6854.36
08:25:07 AM      0.03   6157.04
08:35:04 AM      0.02   8164.11
08:45:13 AM      0.03   6409.18
08:55:19 AM      0.02   7112.54
Average:         0.03   2702.19

09:27:24 AM       LINUX RESTART

09:35:01 AM    proc/s   cswch/s
09:45:02 AM      0.03   2101.04
09:55:01 AM      0.60    433.35
Average:         0.31   1268.57

... with a slight increase in things that need to be swapped back in:

nfssrv:/home/apprentice# sar -f ~/sa14 -s 06:00:00 -e 10:00:00 -W
Linux 2.6.32-5-amd64 (nfssrv)    01/14/2013      _x86_64_        (1 CPU)

06:05:02 AM  pswpin/s pswpout/s
06:15:03 AM      5.76      2.16
06:25:03 AM      7.05      2.67
06:35:02 AM      8.06      3.45
06:45:02 AM      5.58      2.06
06:55:02 AM      7.80      3.00
07:05:02 AM      7.39      2.74
07:15:03 AM      2.19      0.70
07:25:02 AM      8.23      3.17
07:35:05 AM      7.15      2.51
07:45:02 AM      5.63      2.21
07:55:03 AM      9.60      3.76
08:05:03 AM      8.28      3.13
08:15:03 AM     10.11      3.77
08:25:07 AM      9.90      3.18
08:35:04 AM     10.22      4.49
08:45:13 AM      6.79      2.68
08:55:19 AM      4.53      2.18
Average:         7.31      2.81

09:27:24 AM       LINUX RESTART

09:35:01 AM  pswpin/s pswpout/s
09:45:02 AM      0.00      0.00
09:55:01 AM      0.00      0.00
Average:         0.00      0.00

And the number of files/inodes in use doesn't change much:

nfssrv:/home/apprentice# sar -f ~/sa14 -s 06:00:00 -e 10:00:00 -v
Linux 2.6.32-5-amd64 (nfssrv)    01/14/2013      _x86_64_        (1 CPU)

06:05:02 AM dentunusd   file-nr  inode-nr    pty-nr
06:15:03 AM       105      3360    230579       276
06:25:03 AM       121      3392    230688       276
06:35:02 AM       128      3360    230824       276
06:45:02 AM       106      3360    230759       276
06:55:02 AM       123      3360    230870       276
07:05:02 AM       124      3360    231000       276
07:15:03 AM       105      3360    230939       276
07:25:02 AM       118      3360    231041       276
07:35:05 AM       129      3360    231191       276
07:45:02 AM       105      3360    231132       276
07:55:03 AM       268      3360    231470       276
08:05:03 AM       136      3360    231428       276
08:15:03 AM        96      3360    231449       276
08:25:07 AM       103      3200    231576       276
08:35:04 AM       235      3072    231926       276
08:45:13 AM       111      3072    231709       276
08:55:19 AM       451      2848    232090       276
Average:          151      3288    231216       276

09:27:24 AM       LINUX RESTART

09:35:01 AM dentunusd   file-nr  inode-nr    pty-nr
09:45:02 AM    784621      2784    714086       242
09:55:01 AM    789455      2880    718061       245
Average:       787038      2832    716074       244

RAM usage deserves closer inspection:

nfssrv:/home/apprentice# sar -f ~/sa14 -s 06:00:00 -e 10:00:00 -r
Linux 2.6.32-5-amd64 (nfssrv)    01/14/2013      _x86_64_        (1 CPU)

06:05:02 AM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
06:15:03 AM    150348   1910780     92.71       268      4956     82072      1.48
06:25:03 AM    149260   1911868     92.76       260      4316     83556      1.51
06:35:02 AM    148772   1912356     92.78       276      4300     82952      1.49
06:45:02 AM    148556   1912572     92.79       276      5136     81420      1.47
06:55:02 AM    148152   1912976     92.81       252      4640     82952      1.49
07:05:02 AM    146252   1914876     92.90       316      5936     82072      1.48
07:15:03 AM    147972   1913156     92.82       264      3640     82072      1.48
07:25:02 AM    144184   1916944     93.00       276      7356     82072      1.48
07:35:05 AM    141080   1920048     93.16       296      9328     82204      1.48
07:45:02 AM    146304   1914824     92.90       272      4124     83084      1.50
07:55:03 AM    147356   1913772     92.85      1248      2104     82204      1.48
08:05:03 AM    143232   1917896     93.05       444      5524     82208      1.48
08:15:03 AM    145616   1915512     92.94       552      2260     82204      1.48
08:25:07 AM    146936   1914192     92.87      1100      1156     53528      0.96
08:35:04 AM    144368   1916760     93.00      1956      2360     51280      0.92
08:45:13 AM    146868   1914260     92.87       208      1368     50400      0.91
08:55:19 AM    144492   1916636     92.99      2004      2388      6680      0.12
Average:       146456   1914672     92.89       604      4170     72527      1.31

09:27:24 AM       LINUX RESTART

09:35:01 AM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
09:45:02 AM    691976   1369152     66.43    388012     49216     63264      1.14
09:55:01 AM    612160   1448968     70.30    393908    109892     77664      1.40
Average:       652068   1409060     68.36    390960     79554     70464      1.27

That is to say: memory usage is nearly 2GB, but there is still 140MB free. That free space however, is in 4kb slab blocks (as can be read from the OOM killer log above). Only about 80MB is used as cache.

The fact that the OOM killer is triggered without memory being full is easily explained: the NFS kernel server runs in kernel threads. Judging from the logs, it asks for slab, which is never paged out (see Chapter 10 of these excellent Kernel Memory docs: slab is not part of any LRU). Apparently the NFS server needs something larger that 4kb blocks, and to get it, the kernel sends the OOM killer.

It isn't completely clear to me what is using all the memory. The log says that about 1.6GB is unreclaimable slab. Let's split that usage of zone DMA32 out:

free                     140528kB
min                        5704kB
low                        7128kB
high                       8556kB
active_anon                   0kB
inactive_anon               136kB
active_file                 260kB
inactive_file              1096kB
unevictable                   0kB
isolated(anon)                0kB
isolated(file)                0kB
present                 2052256kB
mlocked                       0kB
dirty                       128kB
writeback                     0kB
mapped                       76kB
shmem                         0kB
slab_reclaimable         251148kB
slab_unreclaimable      1629904kB
kernel_stack                904kB
pagetables                 2672kB
unstable                      0kB
bounce                        0kB
writeback_tmp                 0kB
pages_scanned   98

Without fully understanding, I'm guessing that slab fragmentation is what kills us here. NFS needs chunks larger than 4kb (see /proc/slabinfo). The system can only reclaim pages of slab if all the slab in the page is unused. If slab-in-use is scattered over the pages, this may easily not be the case.

This is bad enough if cause by the system caching inodes and dentries alone. To make things worse, NFS also is too old, so we are likely also suffering from bug 1047566.