Manage status messages in CouchDB with MapReduce
On the Couch
Whether the Internet of Things (IoT) or a server landscape, microservices or cron jobs, applications produce all kinds of status messages that you need to collect and evaluate. Whether it's an alert from the robot lawnmower, the abrupt termination of a long-running task, or simply a weather warning, storing the various messages centrally and evaluating them independent of their structure is always going to be a challenge (Listing 1).
Listing 1
Example Status Messages
{ "timestamp":"202405141201", "source": "gardenrobot-1", "message": { "type":"alert", "value":"animal" } } { "timestamp":"202405141220", "source": "gardenrobot-1", "message": { "type":"warning", "value":"low power" } } { "timestamp":"202405150200", "source": "server" "message": { "type":"task done", "value":"night backup", "result":"done", "errors":[] } } { "timestamp":"202405160800", "source":"18739949083333", "rss":"weatherchannel", "region":"Berlin", "message": { "type":"warning", "value":"rain", "category":"heavy", "chance":"80%" } }
CouchDB can help with centralized acquisition and subsequent filtering of status messages (e.g., by number, hour, or source). Its main advantage, and one that it shares with other NoSQL databases, lies in the schema-free nature of the data. Each CouchDB dataset can have its own structure, as long as it can be mapped in JSON format. This freedom means that many different status messages can be managed and queried in a single database without the need for tables adapted to the structure of the messages.
If the format of a status message changes, you just need to reflect this in the queries; no changes need to be made to the database that stores the messages. A document-based database of this type makes sense for dissimilar data structures such as status messages from different sources. If the database – like CouchDB – also supports simple clustering, replication, and query options, it is definitely worth a second look.
CouchDB is one of the original NoSQL databases. Damien Katz developed the software back in 2005. The basic idea came from his previous job as a senior developer at Lotus Notes, distributed collaboration software. He combined the schema-less, document-oriented approach of Lotus Notes with the – at the time – relatively new MapReduce technology, which can query large amounts of data on distributed systems. CouchDB has been an Apache project since 2008. Version 1.0 in 2010 has evolved into version 3.3 today.
"Couch" was originally an acronym for "cluster of unreliable commodity hardware," which reflected the fact that the system also works well without powerful high-availability servers. However, a second meaning (represented by the logo with the couch) could refer to the simplicity with which databases can be set up without a fixed schema.
The capabilities of CouchDB go beyond a pure key-value store. The database system shines with a multimaster replication model, ACID-compliant (atomicity, consistency, isolation, and durability) document storage, indexing functions or MapReduce technology in JavaScript, and the Mango query language.
The needed binaries and information for installing CouchDB can be found on the project's website [1]. For Windows and macOS, just download the installers directly; for CentOS/Debian and Ubuntu, a little typing at the command line is all it takes to install the packages directly from the repositories, where the source files also reside.
CouchDB is written in Erlang/OTP (Open Telecom Platform), a functional language from the world of telecommunications. The strengths of this language type are its simple parallel processing, fault tolerance, and robustness. From the outset, CouchDB development focused on distributed databases on the network. Erlang was the tool of choice for this application to ensure specifically the security and consistency of data in a cluster with a high load.
During installation, you need to distinguish between a standalone install and a variant as part of a cluster; the standalone installation is fine if you just want to gather an initial impression.
Fauxton
CouchDB does not use its own transport format for communication but relies entirely on its HTTP REST API. After completing the installation with the default values, the CouchDB instance listens on port 5984. Depending on the installation (local host or IP address or domain name), with a call in your browser or a GET request sent to the base address, you can request a short status message as a test:
{ "couchdb":"Welcome", "version":"3.3.3", "git_sha":"40afbcfc7", "uuid":"3b74c04721ee61dbe9db74ac3c69e8f8", "features":["access-ready", "partitioned", "pluggable-storage-engines", "reshard", "scheduler"], "vendor":{"name":"The Apache Software Foundation"} }
The vendor name can be changed easily later on, as can the port and other settings that have not been addressed here. The built-in front end, Fauxton, is fine for getting to know the basics of CouchDB. I'll assume you have a local installation and can reach Fauxton on http://localhost:5984/utils . Initially, you will see a database overview (Figure 1).
The two internal databases _replicator and _users are already in place. As shown in Figure 1, the Databases choice in the sidebar shows all existing databases and their details. You can also adjust the security settings here for each database and delete databases. Once a database has been created, it cannot be renamed or automatically emptied. The installation type can be found in Setup , with the choice of Configure Single Node or Configure Cluster .
The Active Tasks item takes you to all active tasks in CouchDB. Configuration is where can you adjust some of the settings and add some extra, non-standard entries to the settings. Replication takes you to a list of current and past replications. You can also create a new replication at this point. Selecting News lets you integrate news from a blog, Documentation contains links to various documentation sources, Verify lets you check the installation, and Your Account is where you manage the current admin account or set up new admin accounts.
Storing Messages
The question now is how the status messages get into CouchDB, for which no special CouchDB format or protocol exists. All actions rely on the HTTP REST standard. Whether browser, Python, curl
, Postman, or Lisp – anything that speaks HTTP REST can be used. Of course, many languages have helpers and wrappers that translate more complex tasks into the HTTP calls, but – at the core – all actions are GET
, PUT
, POST
, or DELETE
calls, and on port 5984 by default. After the install, you can change the port if needed.
The hierarchy of a CouchDB installation is relatively flat, starting with databases, and internally each database stores JSON documents as logical and physical units. A document can contain additional data and binary non-JSON formats as attachments.
The names of everything that is important for the CouchDB system itself start with an underscore – be it the system databases _users
and _replicator
, documents (e.g., _design
), or document fields (e.g., _id
and _rev
). The user cannot create databases or documents and fields with a leading underscore unless they belong to the CouchDB system.
A username and admin password were already entered during the installation. For the sake of simplicity, this account is also used in Fauxton and for creating databases with curl
or Python. If you have a username/password combination of admin/admin
, the attribute as used in the calls is
Authorization: Basic YWRtaW46YWRtaW4=
Of course, you will want to create different users and roles in production operation. Just for the sake of completeness, it should be mentioned that you can automatically check (validate_doc_update
function) or change (update
function) the data when saving documents.
Databases
Starting with the Database view in Fauxton, create a database named messages by clicking the Create Database button at top. The Non-partitioned setting is fine here. Very large databases can be partitioned if you create them such that queries are only ever made against a specific subset of the data. After creating a database in Fauxton, you are taken directly to the Database view. The command-line alternatives to the actions described for Fauxton are shown in Listing 2. The response from CouchDB is a short {ok:true} if successful or an error message if not:
{"error":"file_exists","reason":"The database could not be created, the file already exists."}
Listing 2
Command-Line Communication
### Create a database $ curl -X PUT localhost:5984/messages -H "Authorization: Basic YWRtaW46YWRtaW4=" ### Delete a database $ curl -X DELETE localhost:5984/messages -H "Authorization: Basic YWRtaW46YWRtaW4=" ### Show details of a database $ curl -X GET localhost:5984/messages ### New document with PUT $ curl -X PUT http://localhost:5984/messages/first_message -d '{"message":"Hello"}' ### New document with POST $ curl -X POST http://localhost:5984/messages -d '{"_id":"second_message","message":"World"}' -H "Content-type: application/json" ### Retrieve document with GET $ curl -X GET http://localhost:5984/messages/first_message ### Retrieve all documents in a database $ curl -X GET http://localhost:5984/messages/_all_docs ### Retrieve selected documents including content $ curl -X POST http://localhost:5984/messages/_all_docs?include_docs=true -d '{"keys":["first_message","second_message"]}' -H "Content-type: application/json"
Communication with CouchDB is most likely going to be through an application. In the basic structure, a program must send GET
, PUT
, POST
, and DELETE
requests and populate the URL, the authentication headers, and the data to be transferred. Everything else is controlled by the content of the requests. Listing 3 shows a simple example in Python: a basic program that creates a database and outputs the result of this operation. The database is created on the first run, a second run would provoke an error message.
Listing 3
Creating a Database in Python
import json import urllib3 http = urllib3.PoolManager() COUCHDB_URL = "http://localhost:5984" AUTH = 'Basic YWRtaW46YWRtaW4=' # admin/admin HEADERS = {'Content-type': 'application/json','Authorization': AUTH} def create_database (database_name): url = f'{COUCHDB_URL}/{database_name}' result = http.request('PUT', url, headers=HEADERS) return (json.loads(result.data)) print(create_database ("messages"))
The CouchDB security strategy stipulates that access to a database must be governed by means of users or roles. In the Fauxton database overview, you can set the access rights for each database in the Actions column (Figure 2). Creating a database automatically defines the logged-in user (admin
in this example) as the admin. If you remove all users and roles under the Permissions
item of a database, you end up with a public database anyone can read and write without authentication, which is fine for an initial test in a local environment. Authentication headers are no longer required for the requests, which certainly saves some typing when you are trying out curl
at the command line.
Just as easily as the database can be created, it can be deleted again with a DELETE
instead of a PUT
request (Listing 2), but be careful: The delete action happens immediately. All data in the database is lost. In Fauxton, you can delete a database in the Actions column of the database overview. Fauxton prompts you before deleting, just to be on the safe side.
A database in a CouchDB installation encapsulates data and queries. With one exception, CouchDB has no real joins, so you do not need to consider whether documents in messages
can be linked to documents in another database. Within a database, each document has a unique ID, but the same ID can also occur in other databases. Once created, some database information and statistics can be retrieved by GET
requests. Please remember, authentication is not necessary if no users and no roles are defined for this database.
Buy this article as PDF
(incl. VAT)