« Previous 1 2 3 4 Next »
Best practices for KVM on NUMA servers
Tuneup
Manual CPU Pinning
So far, optimization efforts have been related to guest memory optimization and not to vCPU tuning. The cpuset
XML attribute shows that vCPUs will run on a pCPU part of a set of pCPUs, but it doesn't state which one.
To specify a one-to-one relationship between vCPUs and pCPUs, you have to use the vcpupin
attribute, which is part of the <cputune>
element. To pin the four vCPUs of vm1 to pCPUs 0, 1, 2, and 3 and the vm2 vCPUs to pCPUs 4, 5, 6, and 7, configure the guests as shown in Listing 3 (Figure 7).
Listing 3
Guest Configuration
01 # grep cpu /etc/libvirt/qemu/vm[12].xml vm1.xml: <vcpu placement='interleave'>4</vcpu> vm1.xml: <cputune> vm1.xml: <vcpupin vcpu='0' cpuset='0'/> vm1.xml: <vcpupin vcpu='1' cpuset='1'/> vm1.xml: <vcpupin vcpu='2' cpuset='2'/> vm1.xml: <vcpupin vcpu='3' cpuset='3'/> vm1.xml: </cputune> vm2.xml: <vcpu placement='interleave'>4</vcpu> vm2.xml: <cputune> vm2.xml: <vcpupin vcpu='0' cpuset='4'/> vm2.xml: <vcpupin vcpu='1' cpuset='5'/> vm2.xml: <vcpupin vcpu='2' cpuset='6'/> vm2.xml: <vcpupin vcpu='3' cpuset='7'/> vm2.xml: </cputune>
Expose NUMA to Guests
Libvirt allows exposing the guests of an underlying virtual NUMA hardware infrastructure. You can use this feature to optimize scale-up of multiple applications (e.g., databases) that are isolated from each other in the guests by using cgroups or other container mechanisms.
Listing 4 shows four NUMA nodes (called cells in libvirt) in guest vm1, each with two vCPUs, 17.5GB of memory, and a global interleave
memory mode from physical NUMA nodes 0-3.
Listing 4
NUMA Node Setup
01 # grep -E 'cpu|numa|memory' /etc/libvirt/qemu/vm1.xml 02 <memory unit='KiB'>73400320</memory> 03 <vcpu placement='static'>8</vcpu> 04 <numatune> 05 <memory mode='interleave' nodeset='0-3'/> 06 </numatune> 07 <cpu> 08 <numa> 09 <cell cpus='0,1' memory='18350080'/> 10 <cell cpus='2,3' memory='18350080'/> 11 <cell cpus='4,5' memory='18350080'/> 12 <cell cpus='6,7' memory='18350080'/> 13 </numa> 14 </cpu>
When booted, the numactl -H
command performed in the guest returns the information shown in Listing 5.
Listing 5
Output of numactl -H
01 # ssh vm1 numactl -H available: 4 nodes (0-3) node 0 cpus: 0 1 node 0 size: 17919 MB node 0 free: 3163 MB node 1 cpus: 2 3 node 1 size: 17920 MB node 1 free: 14362 MB node 2 cpus: 4 5 node 2 size: 17920 MB node 2 free: 42 MB node 3 cpus: 6 7 node 3 size: 17920 MB node 3 free: 16428 MB node distances: node 0 1 2 3 0: 10 20 20 20 1: 20 10 20 20 2: 20 20 10 20 3: 20 20 20 10
Note that the SLIT table in Listing 5 contains only dummy values populated arbitrarily by the hypervisor.
Automated Solution with numad
Manual tuning such as I've shown so far is very efficient and substantially increases guest performance, but the management of several virtual machines on a single hypervisor can rapidly become overly complex and time consuming. Red Hat realized that complexity and developed the user mode numad(8)
service to automate the best guest placement on NUMA servers.
You can take advantage of this service by simply installing and starting it in the hypervisor. Then, configure the guests with placement='auto'
and start them. Depending on the load and the memory consumption of the guests, numad
sends placement advice to the hypervisor upon request.
To view this interesting mechanism, you can manually start numad
in debug mode and look for the keyword "Advising" in the numa.log
file by entering these three commands:
# yum install -y numad # numad -d # tail -f /var/log/numad.log | grep Advising
Next, configure the guest with an automatic vCPU and memory placement. In this configuration, the memory mode must be strict
. Theoretically, you incur no risk of guest punishment because, in automatic placement, the host can provide memory from all the nodes in the server upon numad
's advice. Practically speaking, if a process requests too much memory too rapidly, numad
does not have the time to tell KVM to extend its memory pool, and you bump into the OOM killer situation mentioned earlier. This issue is well-known and is being addressed by developers.
The following XML lines show how to configure the automatic placement:
# grep placement /etc/libvirt/qemu/vm1.xml <vcpu placement='auto'>4</vcpu> <memory mode='strict' placement='auto'/>
As soon as you start the guest, two numad
advisory items appear in the log file. The first one provides initialization advice, and the second says that the guest should be pinned on node 3:
# virsh start vm1 # tail -f /var/log/numad.log | grep Advising ... Advising pid -1 (unknown) move from nodes () to nodes (1-3) Advising pid 13801 (qemu-kvm) move from nodes (1-3) to nodes (3)
With memtester
, you can test by consuming some memory in the guest and again watching the reaction of numad
:
# ssh vm1 /usr/kits/memtester/memtester 28G 1 ... Advising pid 13801 (qemu-kvm) move from nodes (3) to nodes (2-3)
It advises you to increase by one NUMA node the memory source location for this guest.
« Previous 1 2 3 4 Next »
Buy this article as PDF
(incl. VAT)