Getting started with the Apache Cassandra database

Believable

More Data and a Cluster

A second example is intended to illustrate a more typical NoSQL application case. This time, measured values and timestamps need to be saved. Also, instead of a single instance of Cassandra, a small cluster will be used, which can be set up with the help of Docker for demo purposes. The measurement data here are run times of a ping against the Linux-Magazin website.

Cron executes a simple shell script every minute, which essentially comprises only one line:

/bin/ping -c1 -D -W5 www.linux-magazin.de | /home/jcb/Cassandra/pingtest.pl

The script might also need to set some environment variables for Perl. The pingtest.pl Perl script (Listing 6) takes the output from ping, extracts the values with a regular expression, and stores them in the Cassandra database. In case of emergency, error handling would be necessary, but at this point I want to keep things as simple as possible.

Listing 6

pingtest.pl

01 #!/usr/bin/perl
02 use Cassandra::Client;
03
04 my $client=Cassandra::Client->new(
05     contact_points=>['172.21.0.2', '172.21.0.4', '172.21.0.5'],
06     username => 'admin',
07     password => '********',
08     keyspace => 'pingtest'
09 );
10
11 $client->connect;
12
13 foreach my $line ( <STDIN> ) {
14 if($line =~ /\[(\d+.\d+)\](.*)time=(\d+\.\d+)/) {
15     $client->execute("INSERT INTO pingtime (tstamp, tvalue) VALUES(?, ?)",
16     [$1*1000, $3],
17     { consistency => "one" });
18   }
19 }
20
21 $client->shutdown;

The timestamp, which ping prints thanks to the -D option, represents Unix time (i.e., the number of seconds since January 1, 1970, 0:00 hours). It is multiplied here by 1,000 to obtain the value in milliseconds needed by Cassandra. Also bear in mind that this time is in UTC, which explains the time difference of two hours from central European summer time (CEST=UTC+2) if you live in Germany, as I do.

If you later stop and restart the cluster several times, the internal IP addresses of the node computers can change. If this happens, the addresses in the Perl script need to be adapted; otherwise, the database server will not be found. The script uses the Cassandra::Client module, which you can best install from the Comprehensive Perl Archive Network (CPAN). All dependencies – and there are many in this case – are then automatically resolved by the installation routine.

The easiest way to do this is to start an interactive CPAN shell and install the module:

perl -MCPAN -e shell
install Cassandra::Client

A C compiler is also required.

The column family that will store the timestamp and ping the round-trip time (RTT) is simple:

CREATE TABLE pingtime (tstamp timestamp, tvalue float, PRIMARY KEY(tvalue, tstamp) ) WITH default_time_to_live = 259200 AND CLUSTERING ORDER BY (tstamp DESC);

This statement will take a long time to generate terabytes of data, but it will create at least 1,500 entries per day. If you want to limit the number, you could – as in the example above – specify a time to live (TTL) value (in seconds) for the entire table. Cassandra then automatically deletes the data after this period – three days in this example. Instead of a definition for the whole table, the retention time can also be defined by INSERT. In this case, you would append USING TTL 259200 to the corresponding statement.

This useful feature might tempt you to think about using Cassandra even if you are not dealing with a huge volume of data.

Thus far, I have only brushed the surface of one genuine Cassandra highlight. Clustering and replication – thus, high-performance, fail-safe databases – do not require massive overhead on the part of the admin. Most of these tasks are done automatically by the database.

In Practice

You might want to avoid following the steps in this section on a low-powered laptop, because the load might make it difficult to use – at least temporarily.

To set up a small Cassandra cluster based on Docker, you need to install Docker Compose, a tool for managing multicontainer applications, in addition to Docker. If the cluster is to be operated across different physical hosts, you would have to choose Docker Swarm or Kubernetes; however, this example is deliberately restricted to one host, which keeps configuration overhead low and the installation simple.

On Ubuntu, the easiest way to install the required software is to use the package manager. Afterward, you need to enable Docker control over the TCP interface – instead of just through the local socket file – by editing /lib/systemd/system/docker.service. Add a hash mark in front of the existing line, as shown, and add the second line in its place:

#ExecStart=/usr/bin/dockerd -H fd://
ExecStart=/usr/bin/dockerd -H fd:// -H tcp://0.0.0.0:2375

Next, type:

systemctl daemon-reload
service docker restart

Log out as root and test your setup with a non-privileged user ID by typing:

curl http://localhost:2375/version

Working as this user, create a cassandra subdirectory for the cluster configuration somewhere below your home directory and a docker-compose.yml file with the content from Listing 7. This file provides all the information Docker needs to find the right images; download, unpack, and install them on the desired number of nodes; establish a network connection between the nodes; and create storage volumes.

The Compose file defines four services as components of the application. Each service uses exactly one Docker image. Three of the services are Cassandra nodes, the fourth, Portainer, provides a simple web GUI for controlling the service containers.

The environment section sets some environment variables, which later overwrite settings with the same name in the central Cassandra configuration file, /etc/cassandra/cassandra.yaml, on the respective node. In this way, you can manage with just one image for all Cassandra nodes and still configure each node individually. In addition to the variables shown here, you could use a number of other variables, if needed. Details can be found in the documentation for the Cassandra image [4].

In the Compose file, the image name is followed by a fairly cryptic shell command that is used to start the nodes one after another at an interval of one or two minutes at the very first launch, when the Cassandra data directory is still empty. Attempting to start everything at the same time at first launch will just not work.

Now the $DOCKER_HOST environment variable has to be set host side:

export DOCKER_HOST=localhost:2375

The installation process, which could take a little while, can be started from the cassandra directory with the

docker-compose up -d
command.

Portainer

As soon as everything is finished, the Portainer GUI can be called in the browser on localhost:10001 (Figure 2) for an overview of the running containers, their states, the images they are based on, the networks, and the volumes.

Figure 2: The GUI start page provides an overview of the components.

From here, you can click your way to more specific views. For example, from an overview page of the containers (Figure 3), you can start, stop, pause, resume, remove, or add containers. The overview also reveals the IP addresses of the individual containers on the internal network. From here, you can check the container log (Figure 4), display some performance statistics for each container, inspect the container configurations, or log into a shell on a container.

Figure 3: Container overview page: The last button under the Quick actions column opens a shell on the container.
Figure 4: Viewing the log of a Cassandra container with the Portainer web GUI.

If you dig down even deeper into the details of a single container, you will see that you can create a new image from the running container, which means you can persist changes to the database configuration. The access rights to the container can also be set here.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus