Lead Image © THISSATAN KOTIRAT , 123RF.com

Lead Image © THISSATAN KOTIRAT , 123RF.com

A self-healing VM system

Server, Heal Thyself

Article from ADMIN 51/2019
By
The right combination of mostly free automation and monitoring tools can create a self-healing system, in which your servers fix themselves.

A common definition of a self-healing system is a set of servers that can detect a malfunction within its own operations and then repair any error(s) without outside intervention. "Repair" in this case will specifically mean replacing the problematic node entirely. In the example discussed in this article, I used Monit [1] to monitor the state of each virtual machine and Ansible to execute the replacement of faulty nodes. A DHCP server was also configured to assign new network addresses and reclaim the addresses that are no longer used.

An in-depth tutorial of the technologies used in the examples is not given. It is left up to the reader to acquire additional documentation if needed. Figure 1 below shows an overview of the setup.

Figure 1: The big picture. DHCP services may exist on an outside system, if desired.

Clouds and Hypervisors

The term "hypervisor" in this article means any platform or program that manages virtual machines to share the underlying hardware resources of a cluster of host servers. Under this definition, Amazon Web Services and Azure are included as hypervisors. Traditional examples, such as Red Hat KVM, VMware ESXi, and Xen are more suitable for this application.

Regardless of which platform is chosen, it needs a robust set of APIs for specific operations. The most critical are instance state management (create, delete, power on/off), registration of new instances, and VM attribute queries.

Ideally these APIs should be in a language and protocol that is easily understood by the configuration manager you choose. VM attribute queries are needed so that the other components of the system can detect the network address acquired by the instance's operating system. Because new nodes are created dynamically as needed, the network address can change (or expire when using DHCP). Queries are also useful for reading flags or notes embedded within the instance. A hypervisor that includes some kind of notation field with each instance is highly recommended.

Normal operation of this system allows multiple nodes for instance deletion and creation attempts, some of which are requested at nearly the same time. The job of the hypervisor is to function as a "mutex" for build requests from multiple nodes; that is, it must allow only one node to rebuild an instance of that specific name at a time and prevent duplicate instance names. If the rebuild process is interrupted and cannot continue, the hypervisor should, after a suitable timeout period, completely remove the unfinished instance from its inventory.

The final requirement of a hypervisor is the ability to create instance images. Although it is acceptable to store these images within the hypervisor itself, preferably these images can be exported and imported to and from a file that can be saved and transferred by each node.

In the test environment, VMware vCenter Server [2], which lets admins control all their VMware vSphere environments in one place (Figure 2), was chosen to manage the nodes.

Figure 2: A hypervisor control panel.

Operating System

Not much needs to be said about the operating system (OS), which just needs to support DHCP, along with the node monitor and configuration management program you prefer. It must also be supported by the chosen hypervisor. Most modern operating systems fit these requirements. For even better system protection, you can set the node monitor to "respawn" if it ever shuts down for any reason (notification email is recommended).

The OS for the test environment is Red Hat Enterprise Linux 7.3, which supports easy-to-use protocols such as SSH, DHCP, and Rsync.

DHCP Service

A DHCP server avoids the need to keep a static IP list on each node and needs to supply an IP, subnet mask, and a default route to all nodes. DHCP forces the node monitor to request the current IP of a node from the hypervisor. A separate program might be necessary for the hypervisor to access network information from the operating system.

The lease period for IP addresses should also be adjusted to minimize requests sent over the network and reduce the chances of node monitor failure (while waiting for a new IP). The ideal lease period is infinite but might not be practical for your environment. Listing 1 shows an example dhcpd.conf file from the test environment.

Listing 1

Example DHCP Configuration File

# DHCP Server Configuration file.
#   see /usr/share/doc/dhcp*/dhcpd.conf.example
#   see dhcpd.conf(5) man page
#
subnet 10.14.2.0 netmask 255.255.255.0 {
    option routers                  10.14.2.1;
    option subnet-mask              255.255.255.0;
    option domain-name-servers       10.14.2.210;
    option domain-search             "giantco.cxm";
    range 10.14.2.10 10.14.2.100;
}

Extreme care should be taken to ensure the DHCP server does not exhaust its pool of available IPs. If a rebuilt node cannot acquire a new network address, the configuration manager might time out while waiting and terminate, leaving an "orphan" instance that will never be removed or rebuilt by the system. Multiple DHCP servers could be used to prevent it from becoming a single point of failure.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus