Addressing the remaining issues
We were almost there. After modifying the code to talk to CouchDB, TextMe was successfully pulling data from CouchDB in our development environments. There were just a few remaining issues that needed to be addressed before we could deploy CouchDB to production.
Reducing the view sizes on disk
As I mentioned in a previous post, the amount of disk space consumed by the views was a big problem. If we didn’t do something, we were sure to run out of disk space when migrating our 30 million row messages table to CouchDB.
We determined that it was not what we were emitting from our map functions that was killing us, but how many times we were emitting it. Each of the views emitted a key/value pair for every document in the database. At 30 million documents and 8 views, that ends up being a crap load of key/value pairs.
My colleagues Dave and Jerry took a detailed look at the problem, and came up with a solution. They determined that there was simply no need to be emitting data for each document in the database. While this would give us views that could report statistics by the second, our application only supported presenting statistics by the minute. Even if we were able to support statistics at this level of detail, we doubted our customers would even need it. It was simply not worth the disk space.
So, Dave and Jerry modified the import job described in the previous post to roll up several key statistics by the minute as it was building the documents. When the job finishes processing all of the documents for that minute, it creates a summary document containing all of the rolled up statistics, and adds it to the database. Then, they changed the map functions to only consider these summary documents.
This solution was able to dramatically reduce the sizes of the views on disk, while still supporting the current application functionality. Since we are still persisting all of the original documents to CouchDB, it is possible to add a new statistic to the summary documents should we ever need to.
Oh, and we also picked up two new terabyte database servers, just in case :)
Paginating records in CouchDB
Like many Rails applications, we were using the popular will_paginate gem to paginate results from the database. Given the size of our data sets, pagination was an absolute necessity to keep from using up every last bit of memory.
CouchRest has a
Pager class that paginates over view results, but it is in the CouchRest Core part of the library and doesn’t integrate too well with the object model part of the library. It simply returns the view results as an array of hashes. We were hoping to see a solution that would give us back an array of the corresponding
ExtendedDocument objects. We were also trying to keep our application from having to know about CouchDB outside of the classes described in the previous post. Having completely different pagination strategies for the two databases would make that more difficult.
So, I decided to write some new pagination code that supported the will_paginate interface and integrated a little better with the object model part of CouchRest. I had a quick solution that same day which fetched view results and handed back an array of the corresponding
ExtendedDocument objects. I then spent some time over the next two weeks modifying the code to integrate a little better with CouchRest and add support for CouchRest views, which we weren’t using.
With the new code in place, we can now paginate over a set of contest entries without having to know what database they are coming from.
ContestCampaignEntryDelegate.contest_campaign_entries.paginate( :page => 1, :per_page => 50)
This pagination code eventually made it into CouchRest.
With the remaining issues addressed, it was time to start the production migration. One at a time, we manually started the jobs to move the data from MySQL to CouchDB. When one job completed, we would start the next. As I mentioned before, building the views is very resource intensive. We didn’t want to completely bog down the production machine we were using to do the migration by running multiple jobs at once.
Moving the archived data from MySQL to CouchDB and building all of the views took about a week (a day for this table, a couple of days for that table, etc). Overall, it was a fairly smooth process.
For the initial import, we did not purge any of the data from MySQL. Since we needed to wait until our CouchDB databases were fully populated with all views built before we could start using them, the application needed to continue working with the data in MySQL while the migration was in progress. In anticipation of the eventual switch from MySQL to CouchDB, I added a flag in the application configuration that told the application if it should pull archived data from CouchDB. Once all of the data had been imported and all of the views had been built, we flipped the switch.
With the pouring of a celebratory beer, we watched as our application began pulling data from CouchDB in production. It was time to relax :)
I really wish we had taken the time to record how long our troublesome pages were taking to load before the move to CouchDB. Sadly, we did not. All I can say is that pages that used to occasionally time out were now loading in a few seconds. Since the migration, we have also implemented a few new features that would simply not have been possible without CouchDB due to database performance issues.
The database performance issues we set out to address seem to be a thing of the past. If new ones pop up, I’m confident that we could once again utilize CouchDB to address them.
This project was focused on addressing database related performance issues that we were facing in production. With these issues out of the way, and our CouchDB infrastructure built-out and proven, we will soon be building even more reporting capabilities that would have simply killed our old database. TextMe customers will soon be able to view their data in more ways than they could have imagined.
I am also working on a project that takes advantage of CouchDB’s schema-less nature to let our customers store and utilize data they collect from their customers. Such a feature, which essentially lets customers define their own schema, would have been a challenge to implement in a relational database. With CouchDB, its just a document.
Thoughts about this project, and CouchDB
I learned a ton while working on this project. While vaguely familiar with NoSQL databases before this project, I have just recently become aware of all of the alternatives available. With the enormous amount of data companies are beginning to collect and process, I’m sure that CouchDB and its NoSQL friends will soon become a common component in the operational environments of most companies.
The CouchDB community has been great. The CouchDB and CouchRest mailing lists are extremely active, and have been very helpful. The committers on both of these projects are active, and always eager to help. I’d specifically like to call out Jan Lehnardt and Chris Anderson from the CouchDB project. Jan has commented on a few of these posts, encouraging me to keep writing. He also suggested a more efficient implementation of the CouchRest pagination code I wrote, which I quickly implemented. Chris left a comment on the first post in this series thanking me for writing about CouchDB, and offering his assistance if I needed it. I actually took Chris up on that offer when we were running into issues regarding the sizes of the views on disk. He was quick to reply, offering several suggestions. I’d like to thank Jan and Chris for their support and encouragement.
NoSQL databases are here to stay, and CouchDB is truly unique in this area. The way it handles views, and its support for replication/synchronization set it apart from the others. There are already several large projects, like Ubuntu One, that are relying on CouchDB to deliver what nobody else can. Because of this, I’m sure CouchDB has a very bright future ahead of it.