A self-healing VM system
Server, Heal Thyself
A common definition of a self-healing system is a set of servers that can detect a malfunction within its own operations and then repair any error(s) without outside intervention. "Repair" in this case will specifically mean replacing the problematic node entirely. In the example discussed in this article, I used Monit [1] to monitor the state of each virtual machine and Ansible to execute the replacement of faulty nodes. A DHCP server was also configured to assign new network addresses and reclaim the addresses that are no longer used.
An in-depth tutorial of the technologies used in the examples is not given. It is left up to the reader to acquire additional documentation if needed. Figure 1 below shows an overview of the setup.
Clouds and Hypervisors
The term "hypervisor" in this article means any platform or program that manages virtual machines to share the underlying hardware resources of a cluster of host servers. Under this definition, Amazon Web Services and Azure are included as hypervisors. Traditional examples, such as Red Hat KVM, VMware ESXi, and Xen are more suitable for this application.
Regardless of which platform is chosen, it needs a robust set of APIs for specific operations. The most critical are instance state management (create, delete, power on/off), registration of new instances, and VM attribute queries.
Ideally these APIs should be in a language and protocol that is easily understood by the configuration manager you choose. VM attribute queries are needed so that the other components of the system can detect the network address acquired by the instance's operating system. Because new nodes are created dynamically as needed, the network address can change (or expire when using DHCP). Queries are also useful for reading flags or notes embedded within the instance. A hypervisor that includes some kind of notation field with each instance is highly recommended.
Normal operation of this system allows multiple nodes for instance deletion and creation attempts, some of which are requested at nearly the same time. The job of the hypervisor is to function as a "mutex" for build requests from multiple nodes; that is, it must allow only one node to rebuild an instance of that specific name at a time and prevent duplicate instance names. If the rebuild process is interrupted and cannot continue, the hypervisor should, after a suitable timeout period, completely remove the unfinished instance from its inventory.
The final requirement of a hypervisor is the ability to create instance images. Although it is acceptable to store these images within the hypervisor itself, preferably these images can be exported and imported to and from a file that can be saved and transferred by each node.
In the test environment, VMware vCenter Server [2], which lets admins control all their VMware vSphere environments in one place (Figure 2), was chosen to manage the nodes.
Operating System
Not much needs to be said about the operating system (OS), which just needs to support DHCP, along with the node monitor and configuration management program you prefer. It must also be supported by the chosen hypervisor. Most modern operating systems fit these requirements. For even better system protection, you can set the node monitor to "respawn" if it ever shuts down for any reason (notification email is recommended).
The OS for the test environment is Red Hat Enterprise Linux 7.3, which supports easy-to-use protocols such as SSH, DHCP, and Rsync.
DHCP Service
A DHCP server avoids the need to keep a static IP list on each node and needs to supply an IP, subnet mask, and a default route to all nodes. DHCP forces the node monitor to request the current IP of a node from the hypervisor. A separate program might be necessary for the hypervisor to access network information from the operating system.
The lease period for IP addresses should also be adjusted to minimize requests sent over the network and reduce the chances of node monitor failure (while waiting for a new IP). The ideal lease period is infinite but might not be practical for your environment. Listing 1 shows an example dhcpd.conf
file from the test environment.
Listing 1
Example DHCP Configuration File
# DHCP Server Configuration file. # see /usr/share/doc/dhcp*/dhcpd.conf.example # see dhcpd.conf(5) man page # subnet 10.14.2.0 netmask 255.255.255.0 { option routers 10.14.2.1; option subnet-mask 255.255.255.0; option domain-name-servers 10.14.2.210; option domain-search "giantco.cxm"; range 10.14.2.10 10.14.2.100; }
Extreme care should be taken to ensure the DHCP server does not exhaust its pool of available IPs. If a rebuilt node cannot acquire a new network address, the configuration manager might time out while waiting and terminate, leaving an "orphan" instance that will never be removed or rebuilt by the system. Multiple DHCP servers could be used to prevent it from becoming a single point of failure.
Buy this article as PDF
(incl. VAT)
Buy ADMIN Magazine
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Most Popular
Support Our Work
ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.