|<< Part 3: Views - The Advantages||Part 5: Application Changes >>|
In the previous post, I wrote about the many features of CouchDB views. In this post, I will describe the challenges we faced when replacing MySQL queries with CouchDB views.
One of the largest challenges with CouchDB views is simply wrapping your brain around the map/reduce model. If you’ve spent any significant amount of time in the relational model, this can be quite a task. Do not underestimate it. Give yourself plenty of time to make this adjustment. In my opinion, setting aside one or two full weeks to read about and experiment with views would not be excessive. It really is a whole new world. After several weeks, I’m still not 100% sure how to use the map/reduce model to its fullest potential. In fact, there were a few queries that I could not figure out how to implement as views. Because of that, I had to keep around an archive table in MySQL, and I run those few queries against that table.
There are view servers available in other programming languages, and it is easy to configure CouchDB to use them. But, CouchDB is still young and under heavy development, and these alternative view servers are not supported by the CouchDB team. So, use them at your own risk.
Multiple views, one design document
Views live in documents called “design documents”. Views within the same design document share a B-Tree data structure. This means that when one view in the design document is built, they all are built. So, careful planning is required to make sure unrelated views do not live in the same design document. You would not want the re-building of one view to delay the accessibility of another, totally unrelated view.
Views can take a L…O…N…G time to build from scratch. The view building process is also very resource intensive. This becomes less of an issue once the views have been built, as views are updated incrementally. It really only comes into play when you are adding many, many documents to a CouchDB database between view builds. However, one place where this is an issue is ad-hoc queries. Every week or two, I’ll get a request from a customer for data that is not available via our web application. While we’ll throw that request onto the product backlog so it is eventually available via the application, it doesn’t change the fact that our customer needs that information now. We usually satisfy such requests by firing up the MySQL client, and running one or more ad-hoc queries. This simply is not feasible using CouchDB views, especially if you are working with a large database containing millions of documents. CouchDB does support “temporary views”, which are ad-hoc views that you can build and execute on the fly. However, temporary views are not recommended for production use, because they need to be built before they can get you the data you need. This could take hours, or days depending on the size of you database and the processing power of your database server. Temporary views are meant as a way to test new views in development which will eventually be saved into a design document, and not for running ad-hoc queries.
View sizes on disk
I’ve already mentioned that each design document is stored in its own B-Tree, completely separate from the B-Tree that holds the documents in the database. These data structures can become quite large, especially if you have a ton of documents in your database. A large database combined with several design documents can take up quite a bit of disk space. For example, our main messaging database consisting of around 30 million documents is 20GB on disk. The 8 design documents for that database combine for a total size of 35GB. This brings the grand total, for the documents and the views, to 55GB. That’s a whole lot of disk space. CouchDB sacrifices disk space for performance, which is a good tradeoff, as disk space is cheap and virtually limitless these days. But sadly, this is not the case for everyone. To add enough storage capacity to handle this database, the other large databases, a mirror/backup of these databases, and still have room to grow, we were looking at an additional several hundred bucks a month in charges from our hosting company for the additional disk space. That can be a lot to handle for a small company. Some larger companies use SANs or similar storage devices that offer unmatched redundancy and reliability. However, these devices will often have a fixed maximum amount of storage, and can be very expensive to upgrade one the storage capacity is maxed out. Justifying the use of so much disk space on such an expensive resource could prove difficult.
Views were without a doubt the hardest part of this project to get right. My colleagues Dave and Jerry are still hard at work trying to find creative ways to reduce the size of the views on disk. We’re very happy with the performance boots that we’ve seen using views in our testing environment. But unless we can find a way to reduce the disk usage, or find more affordable storage options, we may never see these performance boosts in production.