Monitoring and service discovery with Consul
Staying on Top
The cloud is rightly considered one of the most significant developments in IT in recent years: It clearly divides the industry into two groups – service providers and users – each of which has specific requirements.
One requirement concerns monitoring: Conventional monitoring in a cloud makes neither the providers of the large platforms nor their users happy, because what makes the cloud special is that it serves up resources dynamically. If the user needs a large amount of power at the moment, they book a corresponding number of virtual machines (VMs). If they only need a fraction of these resources later on, they return the redundant capacity to the cloud provider's pool.
The pool, however, must be monitored very carefully by the provider. The provider needs to know at all times how many resources can still be distributed to users – and when it's time to scale up the platform by adding more hardware.
A Different Kind of Monitoring
From the user's and the provider's point of view, traditional monitoring approaches are of limited suitability for monitoring cloud platforms. Their view of the world is usually binary: Either a system or a service works as required, so that the corresponding entry on the monitoring system is green, or not, in which case the entry is red, and an escalation spiral is set in motion (Figure 1). If necessary, admins are dragged from their beds.
This principle is not enough in clouds. Monitoring also means that the provider receives regular information about the utilization of the platform from their monitoring system so they can expand the platform if necessary and with ease, in contrast to many classic monitoring systems. If you install 200 nodes, you do not want to add the new hosts manually to your monitoring system in a process taking hours. Instead, some kind of automatic detection is required.
Monitoring requirements are changing even more radically from the user's perspective: Users not only want to know whether their service is still available in principle, but also how high the current load is. Once the point is reached where the existing web server VMs are fighting a losing battle, the cloud platform ideally starts additional VMs and integrates them seamlessly into the existing virtual environment. If the load decreases again, the cloud ideally shuts down VMs that are no longer needed, so the customer does not incur unnecessary costs.
Additionally, the failure of a single component within a virtual cloud environment should not affect its functionality. If the setup is built correctly, it survives the failure without problems. Monitoring in the cloud should only sound an alert if a real problem has occurred that restricts the functionality of the environment. The ability to adapt appropriately to change is key in these cases.
Against this background, it is interesting that monitoring in clouds also differs in other respects: If you only want to monitor systems, you build the monitoring instance as an external component, independent of the setup itself. In the cloud, however, the results of monitoring operations immediately change the setup (e.g., with the addition or removal of VMs). The classical view of conventional monitoring obviously no longer works in such environments; in fact, clouds sometimes radically redefine monitoring.
How does the provider design its cloud monitoring to be flexible and dynamic? How do customers monitor their setups so that they automatically scale horizontally or at least semi-automatically? These are the questions I investigate in this article, with the focus on Consul [1], which itself provides monitoring functions but can also be perfectly coupled with solutions such as Prometheus [2].
What Works and What Doesn't
To begin, you should realize what can be achieved in terms of automatic scalability. Cloud providers have an easier job, because clouds usually only scale horizontally and never shrink. Once the admin has added a node to the setup, they can usually assume that the node will remain in the setup permanently. If the active monitoring system sounds an alarm in the event of a system failure, the provider's point of view is simple: it is almost always an error when entire servers disappear.
From the user's perspective, however, it can be an absolutely legitimate scenario for VMs that were previously part of the setup to disappear – that is, when the cloud reduces the setup because the current load is low. If monitoring alerts in such a case, it is a genuine false alarm. What applies to the provider is, of course, also true for the cloud user. If the cloud adds new VMs to the existing system, these should ideally also be included automatically in the existing monitoring.
Good Luck, Bad Luck
From the cloud providers' point of view, the good news is that a whole range of different monitoring solutions now exist that enable the automatic detection of new nodes through auto-discovery. It looks much less rosy from the users' point of view, because monitoring for use within clouds is almost always focused on a specific platform, and usually a crucial part of it.
Take Amazon, for example. Here, a separate service ensures that virtual setups expand and contract as needed; the provider markets the corresponding functionality under the name Auto Scaling [3]. However, it is practically impossible to use it outside of Amazon Web Services (AWS). The setup is specifically adapted to AWS and is an implicit part of Auto Scaling.
Similarly, OpenStack [4] has a separate service called Senlin [5] for automatic scaling, which, however, has not achieved widespread distribution to date. If you build your setup for OpenStack and Senlin, you can hardly put it to meaningful use in a non-OpenStack environment.
There is a good reason for exclusive automatic scaling: If a cloud scales a setup, an existing monitoring system must communicate intensively with the APIs of that cloud to obtain the necessary information about the desired state of the components. This is the only way to find out whether the actual state matches the target state or whether VMs that should be active are missing in the setup. This task can be handled much better if the components involved are part of the cloud itself.
Buy this article as PDF
(incl. VAT)