Detect failures and ensure high availability
On the Safe Side
Many clustering and high-availability frameworks have been developed for Linux, but in this article, I focus on the more mainstream and widely used Corosync and Pacemaker service and DRBD. If you follow along, you'll learn how to configure an active-passive dual-node cluster that replicates local storage (accessed and written to/read from vital applications or services) to its neighboring node. In this way, you'll be able to host and serve data requests, so long as a single node of the cluster remains online.
High Availability
As we depend more and more on technology and the services it provides, availability becomes increasingly important. In the recent past, hardware vendors and solution providers charged a lot of money for proprietary products that ensured a high level of tolerance for hardware and I/O path failures. Those days have come and gone. Today, the data center has evolved, and with that evolution comes the adoption of more commodity hardware (i.e., hardware that is not built with the level of resilience or sophistication typically seen on a mainframe, an IBM POWER processor, or other equivalent machines).
These non-commodity machines were designed from the bottom up to sustain all sorts of internal hardware and path failures – redundant memory, CPUs, network interfaces, power supplies, and more. However, that type of functionality also came at a much higher cost. A level of complexity also was introduced by those same proprietary systems. Diagnosing, replacing, and repairing faulty or problematic components required both deep pockets and a well-trained technician. These factors were, and continue to be, the primary reasons for opting for commodity technology (containing only a subset of hardware redundancy) and instead relying on the software to handle all sorts of failure scenarios.
More affordable off-the-shelf server solutions provided the data center with the flexibility to build an ecosystem in the preferred way. The only limitation was the extent of the administrators' imaginations. What pieced all of these moving parts together was the software – the very same software that filled the void in fault tolerance.
Therefore, it shouldn't be surprising that many open source software solutions are doing exactly that. Production-grade open source projects have evolved and matured enough to stay competitive with the proprietary solutions of yesterday, and they provide the same feature-rich solutions, if not more.
Zero Downtime
Failures will happen, both in hardware and software. Even if that off-the-shelf commodity server does provide redundant power supplies or its local storage is configured in a RAID configuration, other things can and will go wrong. Processors can fail. Memory can go bad and corrupt vital data. Error correction won't always detect and correct every faulty bit. To address these pain points, systems need to be made redundant by enabling multiple components to perform the same set of tasks. Sometimes you can do this by setting up more than one similar machine and configuring them to be highly available and to accomplish the same set of functions.
The idea behind high availability (HA) is simple: Eliminate any and all single points of failure to ensure that, if a server node or communication path to the underlying storage or service goes down, data requests can still be served. The ultimate goal of configuring a high-availability ecosystem is to provide continuous and uninterrupted service for sometimes critical business applications, all while masking both planned and unplanned outages, including failures that can be a result of system crashes, network failures, storage issues, and more.
Downtime can cost a company time and resources and potentially a loss in business. It's necessary to identify any and all single points of failure and eliminate them by configuring redundant instances, sometimes even balancing the workload across those same redundant instances through a concept typically referred to as "multipath." High-availability technologies are designed to detect failures automatically and recover from them immediately.
High availability doesn't guarantee zero downtime, but you can get pretty close to it. If configured appropriately, you can aim for 99.999 percent uptime.
High Availability Concepts
In the software realm, fault tolerance takes on a unique role. To ensure a level of redundancy, I want to cover a few configuration concepts:
- Active-Active: In an active-active configuration, services and resources are accessible from any and all nodes within the cluster simultaneously. If one node fails, it will not affect availability of the same service or resources to the rest of the nodes.
- Active-Passive: In an active-passive configuration, services and resources are available from only a single node at a time (Figure 1). The rest of the nodes remain passive for that particular service or resource. In the event of a failure on a node hosting that resource, one of the passive nodes resumes availability of that resource.
- Failover (and Failback): Failover is when a service or availability of a resource fails over from one node in the cluster to another. Sometimes when the failed node is back online and in a healthy state, that same service can and will fail back to its original node if configured.
Note that the passive nodes don't need to stay idle. If configured accordingly, resources and services can be balanced across all nodes within the cluster. For instance, Node 1 can host Resource 1, while Node 2 hosts Resource 2. If Node 1 fails, then Node 2 would host both Resource 1 and Resource 2.
Buy this article as PDF
(incl. VAT)