« Previous 1 2 3
Getting started with the Apache Cassandra database
Believable
Cluster Playground
As a last test, if everything is working as desired, you can use Portainer to open a Bash shell on one of the node computers and enter:
nodetool status
Nodetool is a comprehensive tool for managing, monitoring, and repairing the Cassandra cluster. The output of the status
subcommand should look like Figure 5. All three nodes must appear, the status should be Up U
and Normal N
, and each node should have an equal share of the data.
Finally, you can start playing around with the cluster. From the docker host, log into the DC1N1 node created in Listing 7 by specifying its IP address and the port reserved for cqlsh
in the configuration – by default this is port 9042:
Listing 7
docker-compose.yml
01 version: '3' 02 services: 03 # Configuration for the seed node DC1N1 04 # The name could stand for datacenter 1, node 1 05 DC1N1: 06 image: cassandra:3.10 07 command: bash -c 'if [ -z "$$(ls -A /var/lib/cassandra/)" ] ; then sleep 0; fi && /docker-entrypoint.sh cassandra -f' 08 # Network for communication between nodes 09 networks: 10 - dc1ring 11 # Mapped the volume to a local directory. 12 volumes: 13 - ./n1data:/var/lib/cassandra 14 # Environment variable for the Cassandra configuration. 15 # CASSANDRA_CLUSTER_NAME must be identical on all nodes. 16 environment: 17 - CASSANDRA_CLUSTER_NAME=Test Cluster 18 - CASSANDRA_SEEDS=DC1N1 19 # Expose ports for cluster communication 20 expose: 21 # Intra-node communication 22 - 7000 23 # TLS intra-node communication 24 - 7001 25 # JMX 26 - 7199 27 # CQL 28 - 9042 29 # Thrift service 30 - 9160 31 # recommended Cassandra Ulimit settings 32 ulimits: 33 memlock: -1 34 nproc: 32768 35 nofile: 100000 36 37 DC1N2: 38 image: cassandra:3.10 39 command: bash -c 'if [ -z "$$(ls -A /var/lib/cassandra/)" ] ; then sleep 60; fi && /docker-entrypoint.sh cassandra -f' 40 networks: 41 - dc1ring 42 volumes: 43 - ./n2data:/var/lib/cassandra 44 environment: 45 - CASSANDRA_CLUSTER_NAME=Test Cluster 46 - CASSANDRA_SEEDS=DC1N1 47 depends_on: 48 - DC1N1 49 expose: 50 - 7000 51 - 7001 52 - 7199 53 - 9042 54 - 9160 55 ulimits: 56 memlock: -1 57 nproc: 32768 58 nofile: 100000 59 60 DC1N3: 61 image: cassandra:3.10 62 command: bash -c 'if [ -z "$$(ls -A /var/lib/cassandra/)" ] ; then sleep 120; fi && /docker-entrypoint.sh cassandra -f' 63 networks: 64 - dc1ring 65 volumes: 66 - ./n3data:/var/lib/cassandra 67 environment: 68 - CASSANDRA_CLUSTER_NAME=Test Cluster 69 - CASSANDRA_SEEDS=DC1N1 70 depends_on: 71 - DC1N1 72 expose: 73 - 7000 74 - 7001 75 - 7199 76 - 9042 77 - 9160 78 ulimits: 79 memlock: -1 80 nproc: 32768 81 nofile: 100000 82 83 # A web-based GUI for managing containers. 84 portainer: 85 image: portainer/portainer 86 networks: 87 - dc1ring 88 volumes: 89 - /var/run/docker.sock:/var/run/docker.sock 90 - ./portainer-data:/data 91 # Access to the web interface from the host via 92 # http://localhost:10001 93 ports: 94 - "10001:9000" 95 networks: 96 dc1ring:
$ cqlsh -uadmin 172.21.0.2 9042
This command requires a local Cassandra installation on top of the cluster, which is included with the CQL shell. Alternatively, you can use Portainer to open a Linux shell on one of the nodes and then launch cqlsh
there. Next, entering
CREATE KEYSPACE pingtest WITH replication = {'class':'SimpleStrategy','replication_factor':3};
creates the keyspace for the ping run-time example on the three-node cluster with three replicas.
High Availability, No Hard Work
Without having to do anything else, the database now stores copies of every row in the pingtest
keyspace on every node in the cluster. For experimental purposes, the consistency level can be set to different values, either interactively for each session or by using the query in pingtest.pl
.
A value of one
(Listing 6, line 17) means, for example, that only one node needs to confirm the read or write operation for the transaction to be confirmed. More nodes are required for settings of two
or three
, and all the nodes if you select all
. A quorum
setting means that a majority of the nodes at all the data centers across which the cluster is distributed must confirm the operation, whereas local_quorum
means that a configurable minimum number at the data center that hosts the coordinator node for this row is sufficient.
In this way, Cassandra achieves what is known as tunable consistency, which means that users can specify for each individual query what is more important to them. In an Internet of Things application, simple confirmation of a node might be sufficient, the upside being database performance boosts because the database does not have to wait for more nodes to confirm the results. With a financial application, on the other hand, you might prefer to play it safe and have the result of the operation confirmed by a majority of the nodes. In this case, you also have to accept that this will take a few milliseconds longer.
Once you have set up the pingtime
table (Listing 6, line 15) in the distributed pingtest
keyspace and have enabled the cron job described earlier so that data is received, you should now be able to retrieve the same data from the table on all nodes. If this is the case, then replication is working as ordered.
You could now use Portainer to shut down individual nodes. Depending on the consistency setting, the selects will provoke an error message if you address a surviving node that has the data available but cannot find a sufficient number of coworkers to confirm the result.
If you then reactivate a node some time later, no further steps are required in the ideal case. Cassandra takes care of the resynchronization on its own. The prerequisite is that the hinted_handoff_enabled
variable in the central cassandra.yaml
configuration file is set to true
. Cassandra stores, in the form of hints, the write operations the node missed because of its temporary failure and automatically retrofits them as soon as the node becomes available once again. For more complicated cases, such as the node having been down for so long that you run out of space for hints, nodetool
has a repair
command.
If you browse the Cassandra documentation [5], you will find many interesting commands and procedures that you can try on a small test cluster to get a feel for the database that will certainly pay dividends when you go live.
Conclusions
Cassandra is a powerful distributed NoSQL database especially suited for environments that need to handle large volumes of data and grow rapidly. Several advanced features, such as built-in failover, automatic replication, and self-healing after node failure, make it interesting for a wide range of application scenarios. However, migration from a traditional relational database management system (RDBMS) to Cassandra is unlikely to be possible without some overhead because of the need to redesign the data structure and queries.
Infos
- Lakshman, A., and P. Malik. Cassandra – A decentralized structured storage system. ACM SIGOPS Operating Systems Review, 2010; 44(2):35-40, http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf
- Apache Cassandra: http://cassandra.apache.org
- DataStax Constellation: https://constellation.datastax.com
- Cassandra image: https://hub.docker.com/_/cassandra
- Cassandra documentation: https://docs.datastax.com/en/cassandra/3.0/index.html
« Previous 1 2 3
Buy this article as PDF
(incl. VAT)