« Previous 1 2 3 4 Next »
Manage status messages in CouchDB with MapReduce
On the Couch
Design Documents
Queries and other functions are stored in documents in the database in which the data documents are also stored. The document ID always starts with _design/<name>
, which is why it is known as a design document. Not all queries in a database have to be saved in a single design document; on the contrary, it makes sense to distribute them across different design documents for performance reasons. You can imagine a design document as being something like a container for a number of CouchDB functions (view, update, filter, validate). Right now, the MapReduce functions stored as views in the design document are of interest. To retrieve a list of all timestamps for the status messages, you need to save a design document with the views
and the map
function in the messages
database (Listing 4).
Listing 4
Design Document with Views
{ "_id": "_design/queries", "_rev": "6-856a5c52b1a9f33e136b7f044b14a8e6", "language": "javascript", "views": { "by-timestamp": { "map": "function (doc) {\n if (doc.timestamp) {\n emit(doc.timestamp, null);\n }\n}" }, "by-hours": { "map": "function (doc) {\n if (doc.timestamp) {\n emit(doc.timestamp.substr(8,2), null);\n }\n}" } } } # Result of a query /messages/_design/queries/_view/by-timestamp {"total_rows":3,"offset":0,"rows":[ {"id":"386270ebaf851375fc95465741019679","key":"202405141201","value":1}, {"id":"386270ebaf851375fc9546574101a657","key":"202405150200","value":1}, {"id":"386270ebaf851375fc95465741020597","key":"202405160800","value":1} ]}
Because design documents are completely normal data documents, apart from the special ID, the save processes are also the same. Whether you choose PUT
/POST
or you use Fauxton, you just add the design document to the database as usual. Note that you must replace the line endings of the functions with newlines (\n
) when you save the design document.
After saving or modifying a design document with views, indexing all documents stored in this database starts in all of the design document's views. Depending on the number of documents, this process can take some time. You can monitor the progress in Fauxton under Active Tasks
or by sending a GET
request (with authentication) to http://localhost:5984/_active_tasks
. If you add new data documents later on or modify a data document, this function is automatically called up for this one document. The call updates the existing index. The result of a query can be viewed directly in Fauxton (Figure 4).
The results are a little more compact if you use a curl
query or simply point the browser at http://localhost:5984/messages/_design/queries/_view/by-timestamp
. In the query results, the desired index (timestamp
) is now defined as the key with sorting in ascending order. If the index has documents, each view provides the ID of the document and the key-value from the emit
statement.
Of course, you do not want to receive all the list data as a result. Entering the keys in an internal B+ tree lets you quickly implement the desired subset queries. The startkey="20240515"
parameter shows all status messages from May 15, 2024, when you call the view, whereas the endkey="20240516"
parameter limits the view range to messages before May 16 of that year.
A time range can be limited by combining startkey
and endkey
in one call. Bear in mind that the end key is part of the result for a standard query. Accordingly, a query of
startkey="20240516"&endkey="20240516"
returns no results because the search starts on and only returns results up to 20240516
.
The standard query returns a list of IDs and keys, but not the contents of the documents. Either the user program can download the documents individually from the list of IDs, or you need to specify the include_docs=true
parameter in the query. As with _all_docs
, the data document can then be found in the output list below doc
in each line.
The example also reveals the limits of queries with mapping. This query does not reveal which messages occurred in the morning or at night. Subset searches only work from front to back. To search for the time, you need a second view that only outputs the hour (Listing 5).
Listing 5
Views by Timestamp and Hour
{ "_id": "_design/queries", "_rev": "6-856a5c52b1a9f33e136b7f044b14a8e6", "language": "javascript", "views": { "by-timestamp": { "map": "function (doc) { if (doc.timestamp) { emit(doc.timestamp, null); } }" }, "by-hour": { "map": "function (doc) { if (doc.timestamp) { emit(doc.timestamp.substr(8,2), null); } }" }, "by-hour-count": { "map": "function (doc) { if (doc.timestamp) { emit(doc.timestamp.substr(8,2), 1); } }", "reduce": "_count" }, "by-source-type-count": { "reduce": "_count", "map": "function (doc) { if (doc.source && doc.message.type && doc.message.value) { emit([doc.source, doc.message.type,doc.message.value], 1); } }" } } }
Reduce Function
Because the topic of reduce functions is relatively extensive [2], I will only look at CouchDB's own _count
function here. Once a reduce function is available in a view, the mapping results are no longer output as a list, but totaled by the reduce function.
To evaluate the number of status messages per hour, you need to modify the by-hour
view slightly (Listing 5). A 1
is written to the index as a separator for later counting and summarizing. A normal call of the view first totals all rows and outputs the sum as the result. In this case, the key
field is null
(Listing 6), but this result isn't exactly the one expected. The secret lies in the group=true
parameter in the query, which specifies that the total output be sorted by key
, and returns the desired results (Listing 7).
Listing 6
Reduce Without Grouping
$ curl -X GET localhost:5984/messages/_design/queries/_view/by-hours-count {"rows":[ {"key":null,"value":4} ]}
Listing 7
View Grouped by Key
$ curl -X GET localhost:5984/messages/_design/queries/_view/by-hours-count?group=true {"rows":[ {"key":"02","value":1}, {"key":"08","value":1}, {"key":"12","value":2} ]}
Now not only simple values can be used as keys, but also arrays, so that a single view can group and count the results in several levels, including a map function that outputs an array as the key, outputs another value (1 in this case) as value
(Listing 8), and queries the appropriate grouping parameters group_level=<x>
. A group_level
of
does not group at all, and the reduce function in the example only returns a 4
. If you set group_level=1
, the first entry in the array is used for grouping and counting, whereas group_level=2
tells CouchDB to combine the first two entries and count them in groups.
Listing 8
Reduce and Group
$ curl -X GET localhost:5984/messages/_design/queries/_view/by-source-type-count?group_level=1 {"rows":[ {"key":["18739949083333"],"value":1}, {"key":["gardenrobot-1"],"value":2}, {"key":["server"],"value":1} ]}
Of course, this can be combined with the familiar startkey
and endkey
(e.g., to evaluate only the gardenrobot-1
source). Again note the logic that the end key is not included in the output. You need an end key that is greater than gardenrobot-1
. Because you do not know what the next largest key is, the end key must be either gardenrobot-1x
or gardenrobot-1,{}
, because an empty object ranks higher than any string.
Replication
The messages database is the central point where all status messages are received. However, for performance or storage reasons, it makes sense to store the sorted status messages in different databases: one database for IoT, one for weather warnings, and another for server messages. Thanks to CouchDB's replication capabilities, a solution is easily found. To begin, create three additional databases as described at the beginning of the article: iot_messages , weather_messages , and server_messages .
Each database in a CouchDB installation can act both as a replication server and as a replication client. Two databases can also replicate each other. During replication, it does not matter whether or not the databases are in the same CouchDB instance. You can even set up a replication of a database from instance 2 to instance 3 in CouchDB instance 1.
Replication takes place from the changes feed of a CouchDB database, which is where the document IDs of changed (or newly created) documents are stored. Creating, updating, or deleting a document appends the ID and revision number of the document to the changes feed. Previous entries for a document disappear from the feed so that each document ID only appears once.
For replications from Source to Target, the revision number of a changed source document must be greater than the revision number of a target document for the replication to be executed. A small example will illustrate this important point: Start a one-time replication from the database Source to the database Target (but not back). The Source database receives the new document,
{"_id":"d1",_rev:"1-1k9xyc","name":"Kurowski"}
which is now replicated in the Target database, where the identical document is created. A change then occurs in the Target database, which results in a revision number increase in Target:
{"_id":"d1",_rev:"2-7ks112l","name":"Oliver Kurowski"}
If the document is also modified in Source, no replication to Target follows, because the revision number is not higher, but the same. In this case, the data is not consistent: Two different documents have the same ID and version number. However, when the file in Source is changed a second time, the revision number increases to 3-
…, and the document can be replicated to Target.
Deletions of documents are not actually deletions, either. Instead, the database tags the document as _deleted:true
, and the revision number is incremented and can then no longer be called up. If the document is deleted on Target in the previous case, it is given a higher revision number, like a change, and replicating it again would be unsuccessful. You would need to modify the document in Source again for the revision number to continue to increment so that the document can again be replicated.
« Previous 1 2 3 4 Next »
Buy this article as PDF
(incl. VAT)