CouchDB: Views – The Challenges

This is part 4 in a series of posts that describe our investigation into CouchDB as a solution to several database related performance issues facing the TextMe application.

<< Part 3: Views - The Advantages Part 5: Application Changes >>

In the previous post, I wrote about the many features of CouchDB views. In this post, I will describe the challenges we faced when replacing MySQL queries with CouchDB views.

Map/Reduce

One of the largest challenges with CouchDB views is simply wrapping your brain around the map/reduce model. If you’ve spent any significant amount of time in the relational model, this can be quite a task. Do not underestimate it. Give yourself plenty of time to make this adjustment. In my opinion, setting aside one or two full weeks to read about and experiment with views would not be excessive. It really is a whole new world. After several weeks, I’m still not 100% sure how to use the map/reduce model to its fullest potential. In fact, there were a few queries that I could not figure out how to implement as views. Because of that, I had to keep around an archive table in MySQL, and I run those few queries against that table.

Javascript

If you don’t know Javascript, you may want to tack on another week to the learning curve. Javascript is an incredibly powerful language. However, in its raw form it is fairly basic and can take some getting used to. It does help that you don’t need to write too much Javacript to implement most map and reduce functions. However, if you’re new to Javascript, get ready to do some research.

There are view servers available in other programming languages, and it is easy to configure CouchDB to use them. But, CouchDB is still young and under heavy development, and these alternative view servers are not supported by the CouchDB team. So, use them at your own risk.

SQL

As I mentioned in the previous post, views are powerful and flexible. But, views are not nearly as flexible as SQL. SQL has been in development for decades. Even today, it continues to evolve as a language. You can do a ton with SQL. As of right now, views simply cannot rival this flexibility. The CouchDB team continues to add built-in Javascript functions to help write map/reduce code, and there is even talk about supporting map/reduce/merge. But as of right now, the feature sets are not even close. It is very difficult for any new technology to enter the game with the same, or even a comparable feature set to such a battle-hardened veteran. And to be honest, I highly doubt that the CouchDB team is even trying to match SQLs feature set. After all, CouchDB is not meant to be a replacement for the relational database. However, this is an important point to consider if you think your current relational database backed application might be a good fit for CouchDB.

Multiple views, one design document

Views live in documents called “design documents”. Views within the same design document share a B-Tree data structure. This means that when one view in the design document is built, they all are built. So, careful planning is required to make sure unrelated views do not live in the same design document. You would not want the re-building of one view to delay the accessibility of another, totally unrelated view.

Building/Indexing views

Views can take a L…O…N…G time to build from scratch. The view building process is also very resource intensive. This becomes less of an issue once the views have been built, as views are updated incrementally. It really only comes into play when you are adding many, many documents to a CouchDB database between view builds. However, one place where this is an issue is ad-hoc queries. Every week or two, I’ll get a request from a customer for data that is not available via our web application. While we’ll throw that request onto the product backlog so it is eventually available via the application, it doesn’t change the fact that our customer needs that information now. We usually satisfy such requests by firing up the MySQL client, and running one or more ad-hoc queries. This simply is not feasible using CouchDB views, especially if you are working with a large database containing millions of documents. CouchDB does support “temporary views”, which are ad-hoc views that you can build and execute on the fly. However, temporary views are not recommended for production use, because they need to be built before they can get you the data you need. This could take hours, or days depending on the size of you database and the processing power of your database server. Temporary views are meant as a way to test new views in development which will eventually be saved into a design document, and not for running ad-hoc queries.

View sizes on disk

I’ve already mentioned that each design document is stored in its own B-Tree, completely separate from the B-Tree that holds the documents in the database. These data structures can become quite large, especially if you have a ton of documents in your database. A large database combined with several design documents can take up quite a bit of disk space. For example, our main messaging database consisting of around 30 million documents is 20GB on disk. The 8 design documents for that database combine for a total size of 35GB. This brings the grand total, for the documents and the views, to 55GB. That’s a whole lot of disk space. CouchDB sacrifices disk space for performance, which is a good tradeoff, as disk space is cheap and virtually limitless these days. But sadly, this is not the case for everyone. To add enough storage capacity to handle this database, the other large databases, a mirror/backup of these databases, and still have room to grow, we were looking at an additional several hundred bucks a month in charges from our hosting company for the additional disk space. That can be a lot to handle for a small company. Some larger companies use SANs or similar storage devices that offer unmatched redundancy and reliability. However, these devices will often have a fixed maximum amount of storage, and can be very expensive to upgrade one the storage capacity is maxed out. Justifying the use of so much disk space on such an expensive resource could prove difficult.

Summary

Views were without a doubt the hardest part of this project to get right. My colleagues Dave and Jerry are still hard at work trying to find creative ways to reduce the size of the views on disk. We’re very happy with the performance boots that we’ve seen using views in our testing environment. But unless we can find a way to reduce the disk usage, or find more affordable storage options, we may never see these performance boosts in production.

Be Sociable, Share!

    9 thoughts on “CouchDB: Views – The Challenges

    1. One possibility for reducing disk space used is being careful what you emit in your mapping functions. If you have a map function something like:

      function(doc) { emit(doc.key, doc); }

      Then the second parameter is storing a copy of the entire document. If you know that the particular field will only need one or two fields from the document, you could construct it like:

      function(doc) { emit(doc.key, { “field1″: doc.field1, “field2″: doc.field2 }); }

      Or, if you do need the whole document, you could write the map as:

      function(doc) { emit(doc.key), null }

      and then query it with the “include_docs=true” parameter. This takes a performance hit though as it has to do a separate disk seek to pull each document matching each row of the view.

      When I discussed some of these aspects on the mailing list the basic response was, if you’re that worried about disk space then CouchDB might not be suitable for you.

    2. Hi Evan, thanks for the comment.

      I’m a bit behind with these posts, and we’ve since figured out how to reduce the amount of disk space being used by the views. What was burning us was not so much what we were emitting, but how many times we were emitting it. The database in question has approximately 30 million documents. Each document consists of one or more messages. And, each map function emitted data from each message.

      To reduce the size of the views, we simply needed to cut back on what we were emitting (as you suggested). Since we didn’t really need to view the data at this fine level of detail, we created “summary” documents that contained aggregate data for the messages at the level of detail we needed. For example, we had several views that reported certain stats for a given period of time. Before making the change, we were able to get these stats for every second, which we didn’t really need. Instead, the summary documents contain aggregate stats per minute. The views now aggregate data in these summary documents to get us stats per minute, hour, day, etc. This cut back dramatically on the amount of data we were emitting, and the size of the views on disk.

      I was planning on including this in an upcoming post, but I guess there’s no time like the present :)

    3. Hi John,

      Just wanted to say thanks for a great series. It’s very informative to have your experiences, both good and bad, distilled down to a series of blog posts. So thanks a lot and I hope you will keep them coming.

    4. Thanks Jacob. I wasn’t able to find many real-world experiences with CouchDB when we started this project. So, I was hoping that this would help others :)

    5. Pingback: Linktipps #5 :: Blackflash

    6. Hey man, I’m really digging the article. So to return the favor, one way to help speed up views is to place the view cache on a separated set of physical disks than the DB store.

      :)

    7. This is a really great series, I have gone through the whole lot and found them very useful. Thanks so much for taking the time to write this up, it’s a great help to the CouchDB and NoSQL community!

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>