« Previous 1 2 3 4 Next »
Best practices for KVM on NUMA servers
Tuneup
The Danger of a Strict Policy
The memory mode='strict'
XML attribute tells the hypervisor to provide memory from the specified nodeset list and only from there. What happens if the nodeset cannot provide all the requested memory? In other words, what happens in a memory over-commitment situation? Well, the hypervisor starts to swap the guest on disk to free up some memory.
If the swap space is not big enough (which is frequently the case), the Out Of Memory Killer (OOM killer) mechanism does its job and kills processes to reclaim memory. In the best case, only the guest asking for memory is killed because the kernel thinks it can get a lot of memory from it. In the worst case, the entire hypervisor is badly damaged because other key processes have been destroyed.
KVM documentation mentions this mechanism and asks system managers to avoid memory over-commitment situations. It also references other documents explaining how to size or increase the swap space of the hypervisor using a specific algorithm.
Real problems arise when you fall into this situation without knowing it. Consider the case where the vm1 guest is configured with 12GB of memory and pinned to node 0 with this default strict mode added silently by virsh
during the editing operation. Everything should run fine because node 0 is able to provide up to 15GB of memory (see the numactl -H
output). However, when an application or a second guest is scheduled on node 0 and consumes 2 or 3GB of memory, the entire hypervisor is at risk of failure.
You can test (at your own risk) this predictable behavior with the memtester
[5] program. Launch it in a guest configured with a strict
memory policy and ask memtester
to test more memory than available in the nodeset. Watch carefully the evolution of the swap space and the kernel ring buffer (dmesg
); you may see a message like: Out of memory: Kill process 29937 (qemu-kvm) score 865 or sacrifice child. Then: Killed process 29937, UID 107, (qemu-kvm)
.
The default strict memory mode policy can put both guests and the hypervisor itself at risk, without any notification or warning provided to the system administrator before a failure.
Prefer the Preferred Memory Mode
If you want to bind KVM guests manually on specified NUMA nodes (CPU and memory), my advice is to use the preferred
memory mode policy instead of the default strict
mode. This mode allows the hypervisor to provide memory from other nodes than the ones in the nodeset, as needed. Your guests may not run with maximum performance when the hypervisor starts to allocate memory from remote nodes, but at least you won't risk having them fail because of the OOM killer.
The configuration file of the vm1 guest with 35GB of memory, bound to NUMA node 0 in preferred mode, looks like this:
# virsh edit vm1 ... <currentMemory unit='KiB'>36700160</currentMemory> <vcpu placement='static' cpuset='0-9'>4</vcpu> <numatune> <memory mode='preferred' nodeset='0'/> </numatune> ...... MI .........
This guest is in a memory over-commitment state but will not be punished when the memtester
program asks for more memory than what node 0 can provide.
In Figure 4, the top left small terminal shows the preferred numa_mode
as well as the vCPUs' bindings. The other small terminal displays the free memory per NUMA node. Observe that the memory comes from nodes 0 and 1. Note that the same test in strict
mode leads to punishing the guest.
Interleave Memory
When a memory nodeset contains several NUMA nodes in strict
or preferred
policy, KVM uses a sequential algorithm to provide the memory to guests. When the first NUMA node in the nodeset is short of memory, KVM allocates pages from the second node in the list and so on. The problem with this approach is that the host memory becomes rapidly fragmented and other guests or processes may suffer from this fragmentation.
The interleave
XML attribute forces the hypervisor to provide equal chunks of memory from all the nodes of the nodeset at the same time. This approach ensures a much better use of the host memory, leading to better overall system performance. In addition to this smooth memory use, interleave
mode does not punish any process in the case of a memory over-commitment. Instead, it behaves like preferred
mode and provides memory from other nodes when the nodeset runs out of memory.
In Figure 5, the memtester
program needs 35GB from a nodeset composed of nodes 0, 1, 2 with a strict
policy. The amount of memory provided by those three nodes is approximately 13, 15, and 7GB, respectively. In Figure 6, the same request with an interleave
policy shows an equal memory consumption of roughly 11.5GB per node. The command:
<numatune> <memory mode='interleave' nodeset='0-2'/> </numatune>
shows the XML configuration for the interleave
memory policy.
« Previous 1 2 3 4 Next »
Buy this article as PDF
(incl. VAT)
Buy ADMIN Magazine
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Most Popular
Support Our Work
ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.