Sharding and scale-out for databases
Shards
Scalable databases have mushroomed in recent years and, in some cases, rely on completely new architectural approaches. For example, YugabyteDB is a key-value store, but it offers a MySQL compatibility layer. Vitess, on the other hand, uses native MySQL databases in the background but inserts an abstraction layer between the user and the database management system (DBMS) that takes care of sharding and, in turn, horizontal scalability. The Apache project takes a similar approach with its ShardingSphere [1] variants but promises easy handling, seamless extensibility with plugins, and support for most common databases.
Redundancy and Scalability
Databases are a central component of most complex setups. True to the motto "a special tool for every task," a long-established practice is to let databases take care of data management because they solve the task best and most efficiently. The changes that have occurred since the advent of cloud computing – containers and the cloud-ready principle – also lead to tougher requirements for databases. In a scalable environment, the database also needs to be able to scale; single instances of MySQL or PostgreSQL are no longer sufficient.
Out-of-the-box DBMSs come with quite a few limitations that make them unsuitable for operation in cloud environments. Two problems are immediately apparent: First, modern environments assume that the services running in them are implicitly redundant. Typically – and this is especially true in microarchitecture services – any number of instances of a service are available for each task, monitoring each other and seamlessly taking over the tasks of a failed instance in an emergency. Anyone who has ever faced the challenge of achieving high availability for MySQL will be aware that, out of the box, MySQL does not come with any functions for this task. The same is true for PostgreSQL.
The second problem arises from the need for scalability. Monolithic databases of the past, such as MySQL or PostgreSQL, are not designed for seamless horizontal scaling. Instead, the principle is that exactly one central instance of the service is present, where the capacity of the single server can be expanded, at most. In the cloud context, this arrangement is a problem primarily because it is diametrically opposed to the requirements described for cloud-ready applications.
Sharding
Before discovering how the Apache project addresses these problems with ShardingSphere, a review of terminology is needed, especially with regard to the concept of "sharding." You might be familiar with sharding from the early days of IT, although it referred to a completely different technical issue back then.
The journey goes way back into the past, to a time when the capacity of hard disks was still measured in gigabytes, not terabytes. Even then, the operators of large mail servers regularly had the problem of running out of local space for messages. At that time, the term "sharding" primarily meant the distribution of user mailboxes to different machines, which continued to appear to the outside world as one logical mail server. In principle, sharding in ShardingSphere is no different, except it involves databases, not email.
In the database context, sharding means breaking down a database namespace into its logical elements and distributing them to different databases in the background. DBMS sharding is therefore ultimately just an abstraction layer that provides a uniform view to the outside world and, when access occurs, knows to which of the available instances it needs to forward a client's request. Sharding supports additional features in this way, such as replicating individual parts of the namespace (or shards) between different database instances in the background. Deduplication, parallel read-only or read-write access, and encryption during data transfer (on the fly) and when storing data (at rest) can also be implemented in the abstraction layer.
Architecture
Saying that ShardingSphere is a solution is not completely correct. It comprises two components with different feature sets that can be extended with plugins. The reason is historical: ShardingSphere was initially developed as a Java Database Connectivity (JDBC) module (Figure 1).
JDBC describes a driver environment for accessing databases with Java and offers the huge advantages of modularity and the ability to stack and combine modules within the JDBC environment by connecting them in series. The motivation for ShardingSphere was to create a flexible intermediate layer for the dominant databases on the market at the time (e.g., MySQL, PostgreSQL) for sharding, encryption, redundancy, and availability without having to make major modifications to the database itself. Instead, it was enough to define a pool of database back ends and let ShardingSphere do the rest. This scenario is probably where ShardingSphere is used most frequently.
Shortly after its inception in 2016, ShardingSphere-JDBC caused quite a stir. Soon people were looking for a way to use its functionality outside of JDBC, which saw the birth of the second ShardingSphere variant, simply known today as ShardingSphere-Proxy (Figure 2). It does essentially the same as the JDBC variant but comes as a standalone service. In the background, ShardingSphere-Proxy manages a pool of connections to database back ends, much like JDBC, while maintaining protocol compatibility with MySQL and PostgreSQL. Client-side databases connect to the Proxy instead of directly to the database.
The advantage of ShardingSphere is that by managing and controlling the connection between client and server, it can implement all kinds of practical features without requiring a special configuration client-side or on the server. One of the main architectural principles of ShardingSphere is that database clients must always be able to talk to a DBMS in its SQL dialect through the service without errors and without any special customization. Therefore, ShardingSphere always remains transparent from both the client's and the server's point of view.
The functional range of ShardingSphere is quite impressive. The linchpin of all features is, as described, the ability to break down a database into small logical segments – or shards, if you like. The developers emphasize that they can scale both the storage of data and any compute tasks horizontally: The problem with a database often is not that the individual instance does not have enough local space, but rather that it collapses under the load of incoming requests because resources such as CPU and RAM are finite.
Sharding, as implemented by ShardingSphere, avoids this problem, because the back ends that are currently holding the respective shards are still responsible for processing the queries relating to the individual database shards. If two large queries access different data, they are handled by different back ends and do not affect the entire database.
Buy this article as PDF
(incl. VAT)