Efficiently planning and expanding the capacities of a cloud
Cloud Nine
Planning used to be a tedious business for data center operators, taking some time from the initial ideas from customers to the point at which the servers were in the rack and productive. Today this time span is drastically shortened, thanks to the cloud.
The cloud provider must control the infrastructure in the data center to the extent that the platform can expand easily and quickly (i.e., hyperscalability), without major upheavals and its associated costs. Even before installing the first server, you would do well to think about various scalability factors, taking into account the way in which customers will want to use the cloud resources and your ability to enable capacity expansion effectively and prudently.
Although the question of how the available capacity in the data center can be used as efficiently as possible is important, another critical question is how cloud admins collect and interpret metric data from their platform to identify the need for additional resources at an early stage.
In this article, I examine the ingredients for efficient capacity planning in the cloud and explain how they can be implemented best in everyday life. The appropriate hardware makes horizontal scalability easy, and the correct software allows metering and automation.
Farsighted Action
Cloud environments reduce administrative overhead to such an extent that the operator can and must provide within seconds what formerly involved weeks of lead time. The cloud provider must promise its customers that they will have access to virtually unlimited resources, which in turn means a huge amount of effort – both operationally and financially. This scenario only works if the provider has a certain buffer of reserve hardware at the data center. However, providers cannot know what resources which customer will want to use in their setup and when. The question of how cloud capacities can be managed and planned is therefore one of the most important for cloud providers.
New Networking
Networking plays a central role. In the classic setup, each environment is planned carefully, and its scalability limits are fixed right from the start. Anyone planning a web server setup, for example, expects to have to replace the entire platform after five years anyway, because the guarantee on the installed servers will expire. The admins plan these installations with a fixed maximum size and design the necessary infrastructure, such as the network, to match.
Cloud admins don't have this luxury. A cloud must be able to grow massively within a day or week, even beyond the limits of the original planning. Besides, a cloud has no expiration date: Because cloud software can usually scale horizontally seamlessly, old servers can be replaced continuously by new ones.
If you look at existing networks in conventional setups, you usually come across a classic tree or hub-and-spoke structure, wherein one or – with a view to reliability – two switches are connected to other switches by means of appropriately fast lines, to which the nodes of the setup are then attached. This network design is hardly suitable for scale-out installations, because the farther the admin expands the tree structure downward, the less network capability arrives at the individual nodes.
Better Scaling
If you are planning a cloud, you will want to rely on a Layer 3 leaf-spine architecture from the start because of its scalability. This layout differs from the classic approach primarily in that the switches are dumb packet forwarders without their own management functions.
Physically, the network is divided into several layers: The leaf layer establishes the connection to the outside world through routers. The spine switches are installed in each rack and connected to the leaf switches via an arbitrary number of paths.
In this scenario, packets no longer find their way through the network according to Layer 2 protocols but are exchanged between hosts using the Border Gateway Protocol (BGP). Each switch functions as a BGP router and each host also speaks the BGP protocol.
If horizontal scaling is required, all levels of the setup can easily be extended by adding new switches, even during operation. New nodes simply use BGP to inform others of the route via which they can be reached. This routing of the switches means that even hosts that are not on the same network can communicate with each other without problem. Therefore, if you want to roll out 30 new racks in your cloud ad hoc, you can do so. In contrast, classical network designs quickly reach their limits.
However, BGP can currently only run on switches that offer the feature as an expensive part of their own firmware – or that support Cumulus Linux. Mellanox sets a good example by offering future-proof 100Gb switches with Cumulus support. (See a previous ADMIN article [1] about the Layer 3 principle for more information.)
Buy this article as PDF
(incl. VAT)