CouchDB: Views – The Challenges
This is part 4 in a series of posts that describe our investigation into CouchDB as a solution to several database related performance issues facing the TextMe application.
| << Part 3: Views - The Advantages | Part 5: Application Changes >> |
In the previous post, I wrote about the many features of CouchDB views. In this post, I will describe the challenges we faced when replacing MySQL queries with CouchDB views.
Map/Reduce
One of the largest challenges with CouchDB views is simply wrapping your brain around the map/reduce model. If you’ve spent any significant amount of time in the relational model, this can be quite a task. Do not underestimate it. Give yourself plenty of time to make this adjustment. In my opinion, setting aside one or two full weeks to read about and experiment with views would not be excessive. It really is a whole new world. After several weeks, I’m still not 100% sure how to use the map/reduce model to its fullest potential. In fact, there were a few queries that I could not figure out how to implement as views. Because of that, I had to keep around an archive table in MySQL, and I run those few queries against that table.
Javascript
If you don’t know Javascript, you may want to tack on another week to the learning curve. Javascript is an incredibly powerful language. However, in its raw form it is fairly basic and can take some getting used to. It does help that you don’t need to write too much Javacript to implement most map and reduce functions. However, if you’re new to Javascript, get ready to do some research.
There are view servers available in other programming languages, and it is easy to configure CouchDB to use them. But, CouchDB is still young and under heavy development, and these alternative view servers are not supported by the CouchDB team. So, use them at your own risk.
SQL
As I mentioned in the previous post, views are powerful and flexible. But, views are not nearly as flexible as SQL. SQL has been in development for decades. Even today, it continues to evolve as a language. You can do a ton with SQL. As of right now, views simply cannot rival this flexibility. The CouchDB team continues to add built-in Javascript functions to help write map/reduce code, and there is even talk about supporting map/reduce/merge. But as of right now, the feature sets are not even close. It is very difficult for any new technology to enter the game with the same, or even a comparable feature set to such a battle-hardened veteran. And to be honest, I highly doubt that the CouchDB team is even trying to match SQLs feature set. After all, CouchDB is not meant to be a replacement for the relational database. However, this is an important point to consider if you think your current relational database backed application might be a good fit for CouchDB.
Multiple views, one design document
Views live in documents called “design documents”. Views within the same design document share a B-Tree data structure. This means that when one view in the design document is built, they all are built. So, careful planning is required to make sure unrelated views do not live in the same design document. You would not want the re-building of one view to delay the accessibility of another, totally unrelated view.
Building/Indexing views
Views can take a L…O…N…G time to build from scratch. The view building process is also very resource intensive. This becomes less of an issue once the views have been built, as views are updated incrementally. It really only comes into play when you are adding many, many documents to a CouchDB database between view builds. However, one place where this is an issue is ad-hoc queries. Every week or two, I’ll get a request from a customer for data that is not available via our web application. While we’ll throw that request onto the product backlog so it is eventually available via the application, it doesn’t change the fact that our customer needs that information now. We usually satisfy such requests by firing up the MySQL client, and running one or more ad-hoc queries. This simply is not feasible using CouchDB views, especially if you are working with a large database containing millions of documents. CouchDB does support “temporary views”, which are ad-hoc views that you can build and execute on the fly. However, temporary views are not recommended for production use, because they need to be built before they can get you the data you need. This could take hours, or days depending on the size of you database and the processing power of your database server. Temporary views are meant as a way to test new views in development which will eventually be saved into a design document, and not for running ad-hoc queries.
View sizes on disk
I’ve already mentioned that each design document is stored in its own B-Tree, completely separate from the B-Tree that holds the documents in the database. These data structures can become quite large, especially if you have a ton of documents in your database. A large database combined with several design documents can take up quite a bit of disk space. For example, our main messaging database consisting of around 30 million documents is 20GB on disk. The 8 design documents for that database combine for a total size of 35GB. This brings the grand total, for the documents and the views, to 55GB. That’s a whole lot of disk space. CouchDB sacrifices disk space for performance, which is a good tradeoff, as disk space is cheap and virtually limitless these days. But sadly, this is not the case for everyone. To add enough storage capacity to handle this database, the other large databases, a mirror/backup of these databases, and still have room to grow, we were looking at an additional several hundred bucks a month in charges from our hosting company for the additional disk space. That can be a lot to handle for a small company. Some larger companies use SANs or similar storage devices that offer unmatched redundancy and reliability. However, these devices will often have a fixed maximum amount of storage, and can be very expensive to upgrade one the storage capacity is maxed out. Justifying the use of so much disk space on such an expensive resource could prove difficult.
Summary
Views were without a doubt the hardest part of this project to get right. My colleagues Dave and Jerry are still hard at work trying to find creative ways to reduce the size of the views on disk. We’re very happy with the performance boots that we’ve seen using views in our testing environment. But unless we can find a way to reduce the disk usage, or find more affordable storage options, we may never see these performance boosts in production.
CouchDB: Views – The Advantages
This is part 3 in a series of posts that describe our investigation into CouchDB as a solution to several database related performance issues facing the TextMe application.
| << Part 2: Databases and Documents | Part 4: Views – The Challenges >> |
Views are what attracted us to CouchDB. If you’ve been reading the posts in this series, you already know that CouchDB is a document oriented database, and that documents themselves don’t have any official structure beyond the structure enforced by JSON. Views are what provide the necessary structure so that you can run queries on your data.
CouchDB has several strong points, including its efficient B-Tree data store implementation and replication/synchronization support. These strong points already set it apart from other, more traditional databases. However, we came for the views, because we saw views as the potential answer to our database performance woes.
CouchDB builds views using a map/reduce algorithm. When building a view, CouchDB will feed all documents that are new or have changed since the last time the view was built through a map function. The map function selects the documents of interest for that particular view. Then, optionally, a reduce function is run to calculate some aggregate statistics on the documents that have been selected (counts, sums, etc). There are several places you can go on the web for more information about how CouchDB views work.
A large part of the performance issues we are trying to address are being caused by repeatedly running the same database queries against very large tables, where the vast majority of the data in those tables has not changed since it was inserted. The last part of that statement is very important. The data has not changed since it was inserted, and due to the nature of these tables, it probably never will. It was very wasteful for us to keep running the same calculations on that old data.
This is where CouchDB views come in. When CouchDB generates a view, it stores the result of the view on disk in a B-Tree data structure, which is very efficient to access. CouchDB will only re-generate that view when documents that match the criteria specified in the map function are changed or added. And, CouchDB will only need to update the view for the changed/added documents. It will incrementally update the view’s index, so it doesn’t have to start from scratch every time. This makes views especially ideal for large data sets.
Using views, we can replace all of the queries we were performing on these tables, and the calculations would be performed once, and then stored. Accessing that data would be as simple as issuing a single HTTP request, which would efficiently pull the data from the view’s B-Tree. In other words, it would be fast, and very efficient.
CouchDB views are also very flexible. The output of the map function is a key/value pair. That key/value pair can be anything…data from the document, hard coded values, whatever. This flexibility allows you to create complex keys, such as a JSON array of values. Using the view API, you can specify ranges of keys when executing your query, fetching only the data that you want. You also have the ability to group complex keys by the first n elements of the key, and run the reduce function on those groups of data. This enables you fetch aggregate data on multiple levels, and allows you to support multiple queries with a single view. For example, we need to calculate the number of SMS messages sent by a particular account by minute, hour, day, month, year, etc. Using CouchDB’s view engine, we can have our map function emit a key of [account_id, year, month, day, hour, minute] and a value of 1 for each document in our messages database. Our reduce function simply sums all of the values for a matching key, using the provided sum function. Here is how simple the map/reduce code is for this view:
Map
function(doc) {
datetime = doc.created_at_utc;
year = parseInt(datetime.substr(0, 4));
month = parseInt(datetime.substr(5, 2), 10);
day = parseInt(datetime.substr(8, 2), 10);
hour = parseInt(datetime.substr(11, 2), 10);
minute = parseInt(datetime.substr(14, 2), 10);
emit([doc.account_id, year, month, day, hour, minute], 1);
}
Reduce
function(keys, values, rereduce) {
return sum(values);
}
Using the grouping feature of the view API, we can easily fetch message counts for this account by year, month, day, hour, or minute, by simply specifying how many levels of the key we would like to group together. For example, to get a breakdown of messages sent for a particular account on each day in May of 2009, we would simply include the following parameters in our URL when accessing the view: startkey=[1,2009,5]&endkey=[1,2009,5,{}]&group_level=4. These parameters tell the view that we only want to consider messages for account number 1 that were sent or received in May of 2009, and that we’d like the reduce results grouped by the 4th parameter in the key, which is the day of the month. This would return something like:
{"rows":[
{"key":[1,2009,5,1],"value":13},
{"key":[1,2009,5,2],"value":9},
{"key":[1,2009,5,3],"value":10},
{"key":[1,2009,5,4],"value":9},
{"key":[1,2009,5,5],"value":11},
{"key":[1,2009,5,6],"value":17},
{"key":[1,2009,5,7],"value":12},
{"key":[1,2009,5,8],"value":12},
{"key":[1,2009,5,9],"value":14},
{"key":[1,2009,5,10],"value":8},
{"key":[1,2009,5,11],"value":12},
{"key":[1,2009,5,12],"value":11},
{"key":[1,2009,5,13],"value":9},
{"key":[1,2009,5,14],"value":20},
{"key":[1,2009,5,15],"value":7},
{"key":[1,2009,5,16],"value":15},
{"key":[1,2009,5,17],"value":8},
{"key":[1,2009,5,18],"value":8},
{"key":[1,2009,5,19],"value":13},
{"key":[1,2009,5,20],"value":7},
{"key":[1,2009,5,21],"value":12},
{"key":[1,2009,5,22],"value":28},
{"key":[1,2009,5,23],"value":8},
{"key":[1,2009,5,24],"value":4},
{"key":[1,2009,5,25],"value":2},
{"key":[1,2009,5,26],"value":16},
{"key":[1,2009,5,27],"value":15},
{"key":[1,2009,5,28],"value":12},
{"key":[1,2009,5,29],"value":7},
{"key":[1,2009,5,30],"value":5},
{"key":[1,2009,5,31],"value":6}
]}
Views are re-built when they are accessed, and not when new documents are added to the database or existing documents are changed. However, you do have control over when views are built. If you specify stale=ok when accessing your view, CouchDB will not check to see if the view needs to be re-built. It will simply return results from the last time the view was built. We took advantage of this feature when writing the application code to access the views. In our case, data is only added to the database once a day, and it is added by a background job. When the job is finished inserting data into the CouchDB database, it triggers the views to re-build themselves by accessing all of the views in the database (a few at a time), without specifying the stale=ok flag. Since this background job takes on the responsibility of updating the views after it inserts new data, the rest of our application can always specify stale=ok when accessing the views. This keeps the queries executed by the application fast, even when views are in the process of being re-built.
Views are powerful, and offer a tremendous amount of flexibility. However, they come with their own set of challenges. In the next post, I will talk about some of the challenges we faced when attempting to replace our SQL queries against a MySQL database with CouchDB views.
GitHub
Most Popular Posts
Tags
Archives
- May 2012 (1)
- April 2012 (1)
- March 2012 (1)
- February 2012 (1)
- December 2011 (1)
- September 2011 (1)
- July 2011 (1)
- May 2011 (1)
- April 2011 (1)
- March 2011 (1)
- January 2011 (2)
- November 2010 (2)
- September 2010 (1)
- August 2010 (1)
- July 2010 (2)
- June 2010 (2)
- April 2010 (1)
- March 2010 (1)
- February 2010 (2)
- January 2010 (1)
- December 2009 (1)
- November 2009 (1)
- September 2009 (2)
- August 2009 (3)
- July 2009 (2)
- June 2009 (3)
- April 2009 (1)
- February 2009 (1)
- January 2009 (2)
- December 2008 (8)
- November 2008 (2)
- October 2008 (3)
- September 2008 (6)
- July 2008 (3)
- June 2008 (1)
- May 2008 (8)
- April 2008 (6)
- March 2008 (2)
Blogroll
Industury News
Other Links
My GitHub Feed
- jwood pushed to master at signal/signal-ruby
- jwood pushed to master at signal/proby
- jwood pushed to master at signal/proby
- jwood pushed to master at signal/signal-ruby
- jwood pushed to master at signal/proby
- jwood pushed to master at signal/signal-ruby
- jwood pushed to master at signal/signal-ruby
- jwood pushed to master at signal/proby-ruby
- jwood commented on pull request 9 on stripe/stripe-ruby
- jwood pushed to master at signal/proby




