Manage status messages in CouchDB with MapReduce

On the Couch

Documents

To save, you can either use a PUT request with the ID as part of the URL or a POST request to the desired database. A POST request with a new document must contain the _id field. If it is missing, or if the ID of the document is irrelevant, you can leave the field blank when creating the document. CouchDB then assigns a universally unique ID (UUID) for the new document. POST requests to CouchDB have a Content-type:application/json header. After successfully storing a document, CouchDB returns the ID and the initial revision number of the new document. In the event of an error, the database outputs a message to that effect:

{"ok":true,"id":"second_message","rev":"1-183d19dc77574e297dc791f3723caf41"}

Even in a program, saving a document with PUT or POST does not pose any major challenges. If you don't feel like using the command line or writing code, you can use Fauxton to create a document with valid JSON by clicking Create Document in the desired database. CouchDB then suggests a UUID as the _id; you can change this value before saving, if so desired.

Now is a good time to store the list of status messages from Listing 1 in the messages database with one of the three options (Figure 3). One more thought regarding the selected IDs of the documents: If sorting by time or other ascending values is an important criterion, an ID with a leading timestamp is recommended. Without programming a query, you can limit the time range with CouchDB's built-in resources by adding the appropriate options startkey and endkey to the internal _all_docs query. (I get back to this later.)

Figure 3: When creating a document in Fauxton, always consider whether an ID preceded by a timestamp is a good choice.

However, it is not always possible to guarantee uniform IDs, especially with many status messages arriving from different systems, which is why the format of the ID is not taken into account in this example; you will want either to use POST to store the messages without specifying a _id field or to adopt the CouchDB UUID from Fauxton after running Create Document .

ACID, Append Only, MVCC

The revision number, which CouchDB automatically inserts into the document when it is first saved, has a central function in the CouchDB system. If you don't specify the revision number, which comprises a consecutive version number and a hash of the content, documents can be neither changed nor deleted. Moreover, without a specified revision number you have no update options. Why is that?

Data is stored in CouchDB by appending newer data to older data. It fulfills the ACID conditions. A data record is either written completely or – in the event of an error – not written at all. Changes to a data record are appended to newer versions of a document, thus eliminating the need to lock data records during write operations. As long as you do not explicitly delete the data of the old revisions, you can retrieve the entire history of a document with old revision numbers.

To manage concurrent data changes, CouchDB uses the multiversion concurrency control (MVCC) approach, which relies on the revision number. This property plays an important role, especially when several read/write operations take place in very quick succession. If two users open and modify document A with revision number 1-c6438911bbf in one program, the first program that enters the previously valid revision number when saving wins. The save action increments the document's revision number. The version that is backed up second now has an outdated revision key, which means a document update conflict error is reported. At the end of the day, the user program is responsible for conflict management.

The situation is different with CouchDB replications. If the same document is changed in a cluster and replication is delayed (e.g., because of faults), the database system decides which version wins and keeps the second document as the previous version.

MapReduce and Views

Without its own queries, CouchDB is initially just a key-value store with the option of delimiting the document set on the basis of IDs. Individual documents are read with a GET request that specifies the database name and the document ID. To retrieve multiple documents, CouchDB's own _all_docs query is used at the database level.

This query already anticipates the opportunities that MapReduce offers. The _all_docs function returns a list of all documents in a database, including the current revision numbers. What is missing, though, is the content of the documents. If you add the include_docs=true parameter to the URL of the query, CouchDB also outputs the document data in the doc entry in each line.

To narrow down the list of desired documents, you can use the _all_docs query with startkey=<xxx> and endkey=<yyyy> to define the range of keys within which you want to receive the documents. However, these queries only make sense if the document IDs are structured such that they can be easily narrowed down (e.g., by specifying the time).

If you want the database to load a few very specific documents, send an _all_docs POST request with the keys":[<key1>,<key2>,...] attributes. Again, the document content is only returned if you stipulate include_docs=true. However, the capabilities of CouchDB are by no means limited to saving documents schematically and retrieving them again by specifying the ID.

For example, perhaps you only want status messages that include an alert to appear or just all status messages of a certain type. How can you implement this if you can only search for the ID of a status message and these IDs are also random UUIDs? CouchDB is not a plain-vanilla key-value store but can use MapReduce indexes to create queries other than by ID. The standard approach is to program your own map and reduce methods in JavaScript. Other languages can be implemented, as well, by changing the query server setting of the instance. Internal reduce functions such as _count or _sum are implemented natively in Erlang and are therefore very powerful during indexing.

An important point with MapReduce is that you have no access to other documents during indexing. Each document is considered separately. However, you can embed a referenced document in the result at query time with _include_docs=true. The map function always expects a complete document as an input parameter. Depending on the desired index, the emit(key,value) command writes an entry to the index:

function (doc) { emit(doc.timestamp, 1); }

The value can be freely selected and can be zero. Note that, if you want to group the list later with a reduce function, you will need a value that can be totaled or counted. Later, you will want statistics on the number of messages per hour, so first add a 1 as a value to each key. The reduce step is optional; it is necessary if you want to, say, total the results of the list or group the query. To help you get started, one of the natively integrated reduce functions (e.g., _sum or _count) is a good choice. A MapReduce function that writes an index and optionally aggregates it is known as a view in the CouchDB world.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • A TurnKey Linux software evaluation platform
    TurnKey Linux comes with more than 100 of the most important free enterprise solutions to create a test environment for evaluating new open source system or business software on a local system, on a virtual machine, or in the cloud.
  • When you should and should not use NoSQL databases
    Tables are an established format for storing information in databases, but they don't always fit the bill – which is just one argument of many that speaks for NoSQL.
  • Kubernetes StatefulSet
    Legacy databases are regarded as stateful applications and, theoretically, not a good fit for containers. We reveal how classic SQL can still work well on Kubernetes and the database options available to SMEs for scale-out environments.
  • New in PostgreSQL 9.3
    The new PostgreSQL 9.3 release introduces several speed and usability improvements, as well as SQL standards compliance.
  • A web application with MongoDB and Bottle
    I was recently looking for a stable, flexible, and scalable web application for an ongoing project. It wasn't long until I decided on the NoSQL-based MongoDB.
comments powered by Disqus