Best practices for KVM on NUMA servers

Tuneup

Manual CPU Pinning

So far, optimization efforts have been related to guest memory optimization and not to vCPU tuning. The cpuset XML attribute shows that vCPUs will run on a pCPU part of a set of pCPUs, but it doesn't state which one.

To specify a one-to-one relationship between vCPUs and pCPUs, you have to use the vcpupin attribute, which is part of the <cputune> element. To pin the four vCPUs of vm1 to pCPUs 0, 1, 2, and 3 and the vm2 vCPUs to pCPUs 4, 5, 6, and 7, configure the guests as shown in Listing 3 (Figure  7).

Figure 7: pCPU isolation of vm1 and vm2.

Listing 3

Guest Configuration

01 # grep cpu /etc/libvirt/qemu/vm[12].xml
vm1.xml:  <vcpu placement='interleave'>4</vcpu>
vm1.xml:  <cputune>
vm1.xml:    <vcpupin vcpu='0' cpuset='0'/>
vm1.xml:    <vcpupin vcpu='1' cpuset='1'/>
vm1.xml:    <vcpupin vcpu='2' cpuset='2'/>
vm1.xml:    <vcpupin vcpu='3' cpuset='3'/>
vm1.xml:  </cputune>
vm2.xml:  <vcpu placement='interleave'>4</vcpu>
vm2.xml:  <cputune>
vm2.xml:    <vcpupin vcpu='0' cpuset='4'/>
vm2.xml:    <vcpupin vcpu='1' cpuset='5'/>
vm2.xml:    <vcpupin vcpu='2' cpuset='6'/>
vm2.xml:    <vcpupin vcpu='3' cpuset='7'/>
vm2.xml:  </cputune>

Expose NUMA to Guests

Libvirt allows exposing the guests of an underlying virtual NUMA hardware infrastructure. You can use this feature to optimize scale-up of multiple applications (e.g., databases) that are isolated from each other in the guests by using cgroups or other container mechanisms.

Listing 4 shows four NUMA nodes (called cells in libvirt) in guest vm1, each with two vCPUs, 17.5GB of memory, and a global interleave memory mode from physical NUMA nodes  0-3.

Listing 4

NUMA Node Setup

01 # grep -E 'cpu|numa|memory' /etc/libvirt/qemu/vm1.xml
02   <memory unit='KiB'>73400320</memory>
03   <vcpu placement='static'>8</vcpu>
04   <numatune>
05     <memory mode='interleave' nodeset='0-3'/>
06   </numatune>
07   <cpu>
08     <numa>
09       <cell cpus='0,1' memory='18350080'/>
10       <cell cpus='2,3' memory='18350080'/>
11       <cell cpus='4,5' memory='18350080'/>
12       <cell cpus='6,7' memory='18350080'/>
13     </numa>
14   </cpu>

When booted, the numactl -H command performed in the guest returns the information shown in Listing 5.

Listing 5

Output of numactl -H

01 # ssh vm1 numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1
node 0 size: 17919 MB
node 0 free: 3163 MB
node 1 cpus: 2 3
node 1 size: 17920 MB
node 1 free: 14362 MB
node 2 cpus: 4 5
node 2 size: 17920 MB
node 2 free: 42 MB
node 3 cpus: 6 7
node 3 size: 17920 MB
node 3 free: 16428 MB
node distances:
node   0   1   2   3
  0:  10  20  20  20
  1:  20  10  20  20
  2:  20  20  10  20
  3:  20  20  20  10

Note that the SLIT table in Listing 5 contains only dummy values populated arbitrarily by the hypervisor.

Automated Solution with numad

Manual tuning such as I've shown so far is very efficient and substantially increases guest performance, but the management of several virtual machines on a single hypervisor can rapidly become overly complex and time consuming. Red Hat realized that complexity and developed the user mode numad(8) service to automate the best guest placement on NUMA servers.

You can take advantage of this service by simply installing and starting it in the hypervisor. Then, configure the guests with placement='auto' and start them. Depending on the load and the memory consumption of the guests, numad sends placement advice to the hypervisor upon request.

To view this interesting mechanism, you can manually start numad in debug mode and look for the keyword "Advising" in the numa.log file by entering these three commands:

# yum install -y numad
# numad -d
# tail -f /var/log/numad.log | grep Advising

Next, configure the guest with an automatic vCPU and memory placement. In this configuration, the memory mode must be strict. Theoretically, you incur no risk of guest punishment because, in automatic placement, the host can provide memory from all the nodes in the server upon numad's advice. Practically speaking, if a process requests too much memory too rapidly, numad does not have the time to tell KVM to extend its memory pool, and you bump into the OOM killer situation mentioned earlier. This issue is well-known and is being addressed by developers.

The following XML lines show how to configure the automatic placement:

# grep placement /etc/libvirt/qemu/vm1.xml
<vcpu placement='auto'>4</vcpu>
  <memory mode='strict' placement='auto'/>

As soon as you start the guest, two numad advisory items appear in the log file. The first one provides initialization advice, and the second says that the guest should be pinned on node  3:

# virsh start vm1
# tail -f /var/log/numad.log | grep Advising
...
Advising pid -1 (unknown) move from nodes () to nodes (1-3)
Advising pid 13801 (qemu-kvm) move from nodes (1-3) to nodes (3)

With memtester, you can test by consuming some memory in the guest and again watching the reaction of numad:

# ssh vm1 /usr/kits/memtester/memtester 28G 1
...
Advising pid 13801 (qemu-kvm) move from nodes (3) to nodes (2-3)

It advises you to increase by one NUMA node the memory source location for this guest.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus