« Previous 1 2 3 4 Next »
Manage status messages in CouchDB with MapReduce
On the Couch
Documents
To save, you can either use a PUT
request with the ID as part of the URL or a POST
request to the desired database. A POST
request with a new document must contain the _id
field. If it is missing, or if the ID of the document is irrelevant, you can leave the field blank when creating the document. CouchDB then assigns a universally unique ID (UUID) for the new document. POST
requests to CouchDB have a Content-type:application/json
header. After successfully storing a document, CouchDB returns the ID and the initial revision number of the new document. In the event of an error, the database outputs a message to that effect:
{"ok":true,"id":"second_message","rev":"1-183d19dc77574e297dc791f3723caf41"}
Even in a program, saving a document with PUT
or POST
does not pose any major challenges. If you don't feel like using the command line or writing code, you can use Fauxton to create a document with valid JSON by clicking Create Document
in the desired database. CouchDB then suggests a UUID as the _id
; you can change this value before saving, if so desired.
Now is a good time to store the list of status messages from Listing 1 in the messages
database with one of the three options (Figure 3). One more thought regarding the selected IDs of the documents: If sorting by time or other ascending values is an important criterion, an ID with a leading timestamp is recommended. Without programming a query, you can limit the time range with CouchDB's built-in resources by adding the appropriate options startkey
and endkey
to the internal _all_docs
query. (I get back to this later.)
However, it is not always possible to guarantee uniform IDs, especially with many status messages arriving from different systems, which is why the format of the ID is not taken into account in this example; you will want either to use POST
to store the messages without specifying a _id
field or to adopt the CouchDB UUID from Fauxton after running Create Document
.
ACID, Append Only, MVCC
The revision number, which CouchDB automatically inserts into the document when it is first saved, has a central function in the CouchDB system. If you don't specify the revision number, which comprises a consecutive version number and a hash of the content, documents can be neither changed nor deleted. Moreover, without a specified revision number you have no update options. Why is that?
Data is stored in CouchDB by appending newer data to older data. It fulfills the ACID conditions. A data record is either written completely or – in the event of an error – not written at all. Changes to a data record are appended to newer versions of a document, thus eliminating the need to lock data records during write operations. As long as you do not explicitly delete the data of the old revisions, you can retrieve the entire history of a document with old revision numbers.
To manage concurrent data changes, CouchDB uses the multiversion concurrency control (MVCC) approach, which relies on the revision number. This property plays an important role, especially when several read/write operations take place in very quick succession. If two users open and modify document A with revision number 1-c6438911bbf in one program, the first program that enters the previously valid revision number when saving wins. The save action increments the document's revision number. The version that is backed up second now has an outdated revision key, which means a document update conflict error is reported. At the end of the day, the user program is responsible for conflict management.
The situation is different with CouchDB replications. If the same document is changed in a cluster and replication is delayed (e.g., because of faults), the database system decides which version wins and keeps the second document as the previous version.
MapReduce and Views
Without its own queries, CouchDB is initially just a key-value store with the option of delimiting the document set on the basis of IDs. Individual documents are read with a GET
request that specifies the database name and the document ID. To retrieve multiple documents, CouchDB's own _all_docs
query is used at the database level.
This query already anticipates the opportunities that MapReduce offers. The _all_docs
function returns a list of all documents in a database, including the current revision numbers. What is missing, though, is the content of the documents. If you add the include_docs=true
parameter to the URL of the query, CouchDB also outputs the document data in the doc
entry in each line.
To narrow down the list of desired documents, you can use the _all_docs
query with startkey=<xxx>
and endkey=<yyyy>
to define the range of keys within which you want to receive the documents. However, these queries only make sense if the document IDs are structured such that they can be easily narrowed down (e.g., by specifying the time).
If you want the database to load a few very specific documents, send an _all_docs
POST request with the keys":[<key1>,<key2>,...]
attributes. Again, the document content is only returned if you stipulate include_docs=true
. However, the capabilities of CouchDB are by no means limited to saving documents schematically and retrieving them again by specifying the ID.
For example, perhaps you only want status messages that include an alert to appear or just all status messages of a certain type. How can you implement this if you can only search for the ID of a status message and these IDs are also random UUIDs? CouchDB is not a plain-vanilla key-value store but can use MapReduce indexes to create queries other than by ID. The standard approach is to program your own map
and reduce
methods in JavaScript. Other languages can be implemented, as well, by changing the query server setting of the instance. Internal reduce functions such as _count
or _sum
are implemented natively in Erlang and are therefore very powerful during indexing.
An important point with MapReduce is that you have no access to other documents during indexing. Each document is considered separately. However, you can embed a referenced document in the result at query time with _include_docs=true
. The map
function always expects a complete document as an input parameter. Depending on the desired index, the emit(key,value)
command writes an entry to the index:
function (doc) { emit(doc.timestamp, 1); }
The value
can be freely selected and can be zero. Note that, if you want to group the list later with a reduce
function, you will need a value
that can be totaled or counted. Later, you will want statistics on the number of messages per hour, so first add a 1
as a value to each key. The reduce step is optional; it is necessary if you want to, say, total the results of the list or group the query. To help you get started, one of the natively integrated reduce functions (e.g., _sum
or _count
) is a good choice. A MapReduce function that writes an index and optionally aggregates it is known as a view in the CouchDB world.
« Previous 1 2 3 4 Next »
Buy this article as PDF
(incl. VAT)