Best practices for KVM on NUMA servers

Tuneup

The Danger of a Strict Policy

The memory mode='strict' XML attribute tells the hypervisor to provide memory from the specified nodeset list and only from there. What happens if the nodeset cannot provide all the requested memory? In other words, what happens in a memory over-commitment situation? Well, the hypervisor starts to swap the guest on disk to free up some memory.

If the swap space is not big enough (which is frequently the case), the Out Of Memory Killer (OOM killer) mechanism does its job and kills processes to reclaim memory. In the best case, only the guest asking for memory is killed because the kernel thinks it can get a lot of memory from it. In the worst case, the entire hypervisor is badly damaged because other key processes have been destroyed.

KVM documentation mentions this mechanism and asks system managers to avoid memory over-commitment situations. It also references other documents explaining how to size or increase the swap space of the hypervisor using a specific algorithm.

Real problems arise when you fall into this situation without knowing it. Consider the case where the vm1 guest is configured with 12GB of memory and pinned to node  0 with this default strict mode added silently by virsh during the editing operation. Everything should run fine because node  0 is able to provide up to 15GB of memory (see the numactl -H output). However, when an application or a second guest is scheduled on node  0 and consumes 2 or 3GB of memory, the entire hypervisor is at risk of failure.

You can test (at your own risk) this predictable behavior with the memtester [5] program. Launch it in a guest configured with a strict memory policy and ask memtester to test more memory than available in the nodeset. Watch carefully the evolution of the swap space and the kernel ring buffer (dmesg); you may see a message like: Out of memory: Kill process 29937 (qemu-kvm) score 865 or sacrifice child. Then: Killed process 29937, UID 107, (qemu-kvm) .

The default strict memory mode policy can put both guests and the hypervisor itself at risk, without any notification or warning provided to the system administrator before a failure.

Prefer the Preferred Memory Mode

If you want to bind KVM guests manually on specified NUMA nodes (CPU and memory), my advice is to use the preferred memory mode policy instead of the default strict mode. This mode allows the hypervisor to provide memory from other nodes than the ones in the nodeset, as needed. Your guests may not run with maximum performance when the hypervisor starts to allocate memory from remote nodes, but at least you won't risk having them fail because of the OOM killer.

The configuration file of the vm1 guest with 35GB of memory, bound to NUMA node  0 in preferred mode, looks like this:

# virsh edit vm1
...
<currentMemory unit='KiB'>36700160</currentMemory>
 <vcpu placement='static' cpuset='0-9'>4</vcpu>
 <numatune>
    <memory mode='preferred' nodeset='0'/>
 </numatune>
......  MI
.........

This guest is in a memory over-commitment state but will not be punished when the memtester program asks for more memory than what node  0 can provide.

In Figure 4, the top left small terminal shows the preferred numa_mode as well as the vCPUs' bindings. The other small terminal displays the free memory per NUMA node. Observe that the memory comes from nodes  0 and 1. Note that the same test in strict mode leads to punishing the guest.

Figure 4: Preferred policy mode avoids memory over-commitment.

Interleave Memory

When a memory nodeset contains several NUMA nodes in strict or preferred policy, KVM uses a sequential algorithm to provide the memory to guests. When the first NUMA node in the nodeset is short of memory, KVM allocates pages from the second node in the list and so on. The problem with this approach is that the host memory becomes rapidly fragmented and other guests or processes may suffer from this fragmentation.

The interleave XML attribute forces the hypervisor to provide equal chunks of memory from all the nodes of the nodeset at the same time. This approach ensures a much better use of the host memory, leading to better overall system performance. In addition to this smooth memory use, interleave mode does not punish any process in the case of a memory over-commitment. Instead, it behaves like preferred mode and provides memory from other nodes when the nodeset runs out of memory.

In Figure 5, the memtester program needs 35GB from a nodeset composed of nodes  0, 1, 2 with a strict policy. The amount of memory provided by those three nodes is approximately 13, 15, and 7GB, respectively. In Figure 6, the same request with an interleave policy shows an equal memory consumption of roughly 11.5GB per node. The command:

<numatune>
    <memory mode='interleave' nodeset='0-2'/>
  </numatune>

shows the XML configuration for the interleave memory policy.

Figure 5: Strict memory mode leads to fragmentation.
Figure 6: Smooth use of memory with the interleave memory mode.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus