Live migration of virtual machines
Go Live
The cloud is said to be efficient and flexible, but neither virtualization nor billing by resource use are new; in fact, the concepts come from prehistoric IT. Because cloud solutions are now considered mainstream, planners and consultants in IT companies assume that admins will virtualize any new server platform, saving hardware and ongoing operating costs and offering numerous other benefits, more or less incidentally.
One of these benefits is live migration of running systems, which would be impossible on real hardware but which virtualized systems manage quite easily. Freezing a VM running on a host, moving it to another virtualization host, and continuing operations there with no noticeable hiccups in service from the user's point of view, is the art of virtualization and is what separates the wheat from the chaff.
Games Without Frontiers
A few years ago, an archetypal demo setup, in which players of the 3D first-person shooter Quake 3 didn't even notice that the VM and its server had moved from one host to another during the game, caused a sensation [1]. The effective downtime for the client was a few seconds, which could be mistaken as a hiccup on the network.
Admins enjoy live migration for other reasons, however. For example, it allows them to carry out maintenance work on virtualization hosts without extended downtime. Before work starts on a computing node, administrators migrate all the VMs to other servers, allowing them to work on the update candidates to their heart's content. Unfortunately, this process does not work between different products, architectures, or CPU variants, as related in the "Not Without Downtime" box.
Not Without Downtime: Changing the Hypervisor
The most important component in a live migration is the hypervisor, but once you choose a hypervisor, you have to live with your choice, because live migration from one hypervisor to another is currently impossible on Linux. When you build a virtualization environment, you should therefore keep in mind that homogeneous setups are generally easier to maintain than those in which different hypervisors are used.
The principle of homogeneity can be applied equally well to the topic of architecture. Even migrating a VM running on a 32-bit host to a 64-bit hypervisor can be a challenge because 64-bit VMs cannot run properly on 32-bit CPUs. Anyone planning to combine architectures in virtualization setups has to accept the lowest common denominator, and this is commonly 32-bit systems.
Live Migration to Other Hypervisors?
Companies typically ask for the ability to switch from one hypervisor to another, combined with the typical desire for as little downtime as possible. Anybody facing this requirement needs to be familiar with the capabilities of the future solution. KVM and QEMU include a plethora of tools for converting images that lend themselves to intuitive use without poring over the man pages.
However, the idea of migrating a Qemu/KVM VM without downtime to a hypervisor host on Windows is doomed to failure, because live migration with KVM most often happens directly in Qemu, but Qemu cannot communicate with, for example, Microsoft's Hyper-V. This would be a technically elegant solution, but unfortunately, you can file it under "doesn't work."
Even a migration from Xen to KVM can hardly be handled in a meaningful way without interruption – albeit a brief one. If you choose commercial virtualization software, such as VMware, you could be in luck. VMware offers a feature called Virtual to Virtual Migration, which remodels any virtual machine to a new VMware VM. Of course, this doesn't work without imposing any downtime at all, but the outage is still much shorter than it would for a manual conversion.
Technical Background
Today, prebuilt solutions exist for live migration; they have this capability out the box, and in many cases even offer a GUI option for point-and-click live migration. The technical backgrounds of most of these solutions are the same. The virtualization solution copies the content of a virtual machine running on Host A (i.e., the RAM data that needs to be migrated on the fly to a different host) to the target host and launches the virtualization process on the target while the system on host A still continues to run.
When the RAM content has been received in full by the target host, the virtualizer stops the VM on the source host, copies the current delta to the target, and finally terminates the emulator on the source host. In an ideal case, this short time is the only downtime that occurs.
For this principle to work, a few conditions must be met. The main issues relate to storage. To run the virtual machine simultaneously on two hosts, their storage must be available to both computers for read and write operations at the specified time. How this works in reality depends on the shared storage technology and the architecture of the virtualization solution.
Storage: NFS, iSCSI, DRBD, Ceph
A storage area network (SAN) is often a component of a VM setup. SANs serve up their data mostly via NFS or iSCSI. With NFS, shared access to the data is managed easily; after all, this is just what NFS was designed for. iSCSI SANs are trickier and require additional functions in terms of the software that manages the virtual machines. The iSCSI standard does not actually envisage concurrent access to the same logical unit number (LUN) for two servers, so workarounds using LVM or cluster filesystems like OCFS2 and GFS2 are necessary.
VM live migration is easier to set up on a distributed replicated block device (DRBD) cluster basis. DRBD basically provides the option of using an existing resource in dual primary mode [2], for which write access is possible on both sides of a cluster. Together with modern cluster managers such as Pacemaker and Libvirt, this configuration can be used to provide genuine, enterprise-grade live migration capabilities. The drawback of this solution is that DRBD restricts migration to two nodes, at least until DRBD 9 is mature enough for use in a production environment.
Administrators who are looking for reliable virtualization with live migration, as well as scale-out capabilities, should consider Ceph [3] when planning their new setups. The components required for this kind of setup are now viewed as stable and suitable for production use.