OpenStack Trove for users, sys admins, and database admins
Semiautomatic
In the wake of cloud computing and under the leadership of Amazon, a number of as-a-service resources have for several years been cornering the market previously owned by traditional IT setups. The idea is quite simple: Many infrastructure components, such as databases, VPNs, and load balancers, are only a means to an end for the enterprise.
If your web application needs a place to store its metadata, a database is usually used. However, the company that runs the application has no interest in dealing with a database. A separate server, or at least a separate virtual machine (VM) together with an operating system, would need to be set up and configured for the database. Issues such as high availability increase the complexity. A database with a known login and address that the application can connect to would work just as well, which is where Database as a Service (DBaaS) comes in.
The advantage of DBaaS is that it radically simplifies the deployment and maintenance of the relevant infrastructure. The customer simply clicks in the web interface on the button for a new database, which is configured and available shortly thereafter. The supplier ensures that redundancy and monitoring are included, as well.
A DBaaS component for OpenStack named Trove [1] has existed for about three years. Although you can integrate it into an existing OpenStack platform, Trove alone is unlikely to make you happy. If you look into the topic in any depth, you will notice that vendors, users, and database administrators have to work hand in hand to create a useful service in OpenStack on the basis of Trove.
In this article, I tackle the biggest challenges that operating Trove can cause for all stakeholders. OpenStack vendors can discover more about the major obstacles in working with Trove, and cloud users can look forward to tips for the handling Trove correctly in everyday life.
Performance
Database performance, in particular, causes headaches for cloud providers for obvious reasons: Whereas databases in conventional setups are regularly hosted on their own hardware, in the cloud, they share the same hardware with many other VMs.
Storage presents an even greater challenge. A database, such as MySQL, running on real metal can connect to its local storage – usually a hard disk or fast SSD on the same computer – without a performance hit. However, VMs that run in clouds usually do not have local storage; instead, they use volumes that access network storage in the background.
A typical example is Ceph used as a storage back end for OpenStack. Each write operation on a VM results in multiple network reads and writes: The Ceph client on the virtualization server receives the write action and passes it to the primary storage device in the Ceph cluster – that is, in Ceph-speak, its primary Object Storage Device (OSD). This primary OSD then sends the same data in a second step to as many other OSDs as defined by its replication policy (Figure 1).
Only when sufficient replicas are created in the Ceph cluster does the VM's Ceph client send confirmation that the write access was successful. The database client, which originally only wanted to change a single entry in MySQL, thus waits through several network round trips for the operation to complete successfully.
This problem is by no means specific to Ceph: Virtually all solutions for distributed storage in clouds have similar problems. Ceph stands out as a particularly bad example, because the Controlled Replication Under Scalable Hashing (CRUSH) algorithm, which calculates the primary OSD and the secondary OSDs, is particularly prone to latency.
From the provider's point of view, the problem is difficult to manage because a lower limit is clearly defined. Ethernet has an inherent latency that can only be reduced using latency-optimized transport technologies (e.g., InfiniBand), which means the provider chooses a different network technology that has its own challenges.
Paths and Dead Ends
Which approaches are open to a provider to achieve mastery over the topic of latency for DBaaS? The obvious approach is not to store VMs for databases from Trove on network storage, but to run them with local storage. In the OpenStack context, this means that the VM and its hard disk do not reside in Ceph or on any other network storage medium, but directly on the local storage medium of the computer node. In such a scenario, however, it is advisable to start the VM on a node with SSDs, because it offers noticeable performance gains with regard to throughput and latency.
The provider would have to configure their OpenStack to do this: Typically, they would set a separate availability zone with fast local storage and then give customers the opportunity to accommodate Trove databases there.
However, what looks like a good idea at first glance turns out to be a horror scenario on closer inspection. A VM that has been started in this way has no redundancy at all. If the hypervisor node with the VM fails, the VM is simply not accessible. If the disk on which the VM and the database are located fails, the data is lost and the user or provider can resort to a backup (which, one hopes, they have created).
Even if you do not assume the horror scenario of a hardware failure, this type of setup harbors more dangers than benefits for the vendor: If a VM only exists locally, it cannot be moved to another host without downtime – precisely the scenario in the everyday life of a cloud with hundreds of nodes, because, otherwise, the individual servers are virtually impossible to maintain. No matter how you look at it, VMs located on local storage of the individual hypervisor nodes are definitely not a good idea.
Evaluation Is Everything
Despite all the disadvantages of local storage, it is also clear that the latency of local storage can never be achieved with network-based storage, especially in the case of sequential writing. Anyone used to using MySQL on Fusion ioMemory (Figure 2) will almost always experience an unpleasant surprise when switching to a DBaaS database in the cloud.
An area of conflict in which Cloud providers are practically always entangled is: What does the setup need to cover? Before a robust answer can be given, it is virtually impossible to find a suitable storage solution for databases in the cloud.
Many – especially small – setups for cloud customers impose minimal requirements on the database, so network-based storage would be perfectly fine. However, anyone who wants to run large setups with thousands of simultaneous database requests are in trouble. In the first step, the supplier therefore has to analyze the customer's needs to provide the basis for further planning.
Buy this article as PDF
(incl. VAT)