CouchDB: Views – The Advantages

This is part 3 in a series of posts that describe our investigation into CouchDB as a solution to several database related performance issues facing the TextMe application.

<< Part 2: Databases and Documents Part 4: Views – The Challenges >>

Views are what attracted us to CouchDB. If you’ve been reading the posts in this series, you already know that CouchDB is a document oriented database, and that documents themselves don’t have any official structure beyond the structure enforced by JSON. Views are what provide the necessary structure so that you can run queries on your data.

CouchDB has several strong points, including its efficient B-Tree data store implementation and replication/synchronization support. These strong points already set it apart from other, more traditional databases. However, we came for the views, because we saw views as the potential answer to our database performance woes.

CouchDB builds views using a map/reduce algorithm. When building a view, CouchDB will feed all documents that are new or have changed since the last time the view was built through a map function. The map function selects the documents of interest for that particular view. Then, optionally, a reduce function is run to calculate some aggregate statistics on the documents that have been selected (counts, sums, etc). There are several places you can go on the web for more information about how CouchDB views work.

A large part of the performance issues we are trying to address are being caused by repeatedly running the same database queries against very large tables, where the vast majority of the data in those tables has not changed since it was inserted. The last part of that statement is very important. The data has not changed since it was inserted, and due to the nature of these tables, it probably never will. It was very wasteful for us to keep running the same calculations on that old data.

This is where CouchDB views come in. When CouchDB generates a view, it stores the result of the view on disk in a B-Tree data structure, which is very efficient to access. CouchDB will only re-generate that view when documents that match the criteria specified in the map function are changed or added. And, CouchDB will only need to update the view for the changed/added documents. It will incrementally update the view’s index, so it doesn’t have to start from scratch every time. This makes views especially ideal for large data sets.

Using views, we can replace all of the queries we were performing on these tables, and the calculations would be performed once, and then stored. Accessing that data would be as simple as issuing a single HTTP request, which would efficiently pull the data from the view’s B-Tree. In other words, it would be fast, and very efficient.

CouchDB views are also very flexible. The output of the map function is a key/value pair. That key/value pair can be anything…data from the document, hard coded values, whatever. This flexibility allows you to create complex keys, such as a JSON array of values. Using the view API, you can specify ranges of keys when executing your query, fetching only the data that you want. You also have the ability to group complex keys by the first n elements of the key, and run the reduce function on those groups of data. This enables you fetch aggregate data on multiple levels, and allows you to support multiple queries with a single view. For example, we need to calculate the number of SMS messages sent by a particular account by minute, hour, day, month, year, etc. Using CouchDB’s view engine, we can have our map function emit a key of [account_id, year, month, day, hour, minute] and a value of 1 for each document in our messages database. Our reduce function simply sums all of the values for a matching key, using the provided sum function. Here is how simple the map/reduce code is for this view:

Map

function(doc) {
    datetime = doc.created_at_utc;
    year = parseInt(datetime.substr(0, 4));
    month = parseInt(datetime.substr(5, 2), 10);
    day = parseInt(datetime.substr(8, 2), 10);
    hour = parseInt(datetime.substr(11, 2), 10);
    minute = parseInt(datetime.substr(14, 2), 10);
    emit([doc.account_id, year, month, day, hour, minute], 1);
}

Reduce

function(keys, values, rereduce) { 
    return sum(values);
}

Using the grouping feature of the view API, we can easily fetch message counts for this account by year, month, day, hour, or minute, by simply specifying how many levels of the key we would like to group together. For example, to get a breakdown of messages sent for a particular account on each day in May of 2009, we would simply include the following parameters in our URL when accessing the view: startkey=[1,2009,5]&endkey=[1,2009,5,{}]&group_level=4. These parameters tell the view that we only want to consider messages for account number 1 that were sent or received in May of 2009, and that we’d like the reduce results grouped by the 4th parameter in the key, which is the day of the month. This would return something like:

{"rows":[
{"key":[1,2009,5,1],"value":13},
{"key":[1,2009,5,2],"value":9},
{"key":[1,2009,5,3],"value":10},
{"key":[1,2009,5,4],"value":9},
{"key":[1,2009,5,5],"value":11},
{"key":[1,2009,5,6],"value":17},
{"key":[1,2009,5,7],"value":12},
{"key":[1,2009,5,8],"value":12},
{"key":[1,2009,5,9],"value":14},
{"key":[1,2009,5,10],"value":8},
{"key":[1,2009,5,11],"value":12},
{"key":[1,2009,5,12],"value":11},
{"key":[1,2009,5,13],"value":9},
{"key":[1,2009,5,14],"value":20},
{"key":[1,2009,5,15],"value":7},
{"key":[1,2009,5,16],"value":15},
{"key":[1,2009,5,17],"value":8},
{"key":[1,2009,5,18],"value":8},
{"key":[1,2009,5,19],"value":13},
{"key":[1,2009,5,20],"value":7},
{"key":[1,2009,5,21],"value":12},
{"key":[1,2009,5,22],"value":28},
{"key":[1,2009,5,23],"value":8},
{"key":[1,2009,5,24],"value":4},
{"key":[1,2009,5,25],"value":2},
{"key":[1,2009,5,26],"value":16},
{"key":[1,2009,5,27],"value":15},
{"key":[1,2009,5,28],"value":12},
{"key":[1,2009,5,29],"value":7},
{"key":[1,2009,5,30],"value":5},
{"key":[1,2009,5,31],"value":6}
]}

Views are re-built when they are accessed, and not when new documents are added to the database or existing documents are changed. However, you do have control over when views are built. If you specify stale=ok when accessing your view, CouchDB will not check to see if the view needs to be re-built. It will simply return results from the last time the view was built. We took advantage of this feature when writing the application code to access the views. In our case, data is only added to the database once a day, and it is added by a background job. When the job is finished inserting data into the CouchDB database, it triggers the views to re-build themselves by accessing all of the views in the database (a few at a time), without specifying the stale=ok flag. Since this background job takes on the responsibility of updating the views after it inserts new data, the rest of our application can always specify stale=ok when accessing the views. This keeps the queries executed by the application fast, even when views are in the process of being re-built.

Views are powerful, and offer a tremendous amount of flexibility. However, they come with their own set of challenges. In the next post, I will talk about some of the challenges we faced when attempting to replace our SQL queries against a MySQL database with CouchDB views.

Be Sociable, Share!

    3 thoughts on “CouchDB: Views – The Advantages

    1. Pingback: Linktipps #5 :: Blackflash

    2. Do you still need to use “stale=ok” even though data is only added once a day? Or you probably add other data throughout the day.

    3. @anon

      Yes. I still need to use “stale=ok” even though I am only adding data once a day. This is because it takes a while to add the data to CouchDB via that batch job, and I don’t want CouchDB rebuilding the index multiple times during the process data import process. This could happen if somebody tried to access one of the views while I was adding data to CouchDB.

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>