CouchDB: The Last Mile

This is the 6th and final post in a series that describes our investigation into CouchDB as a solution to several database related performance issues facing the TextMe application.

<< Part 5: Application Changes

Addressing the remaining issues

We were almost there. After modifying the code to talk to CouchDB, TextMe was successfully pulling data from CouchDB in our development environments. There were just a few remaining issues that needed to be addressed before we could deploy CouchDB to production.

Reducing the view sizes on disk

As I mentioned in a previous post, the amount of disk space consumed by the views was a big problem. If we didn’t do something, we were sure to run out of disk space when migrating our 30 million row messages table to CouchDB.

We determined that it was not what we were emitting from our map functions that was killing us, but how many times we were emitting it. Each of the views emitted a key/value pair for every document in the database. At 30 million documents and 8 views, that ends up being a crap load of key/value pairs.

My colleagues Dave and Jerry took a detailed look at the problem, and came up with a solution. They determined that there was simply no need to be emitting data for each document in the database. While this would give us views that could report statistics by the second, our application only supported presenting statistics by the minute. Even if we were able to support statistics at this level of detail, we doubted our customers would even need it. It was simply not worth the disk space.

So, Dave and Jerry modified the import job described in the previous post to roll up several key statistics by the minute as it was building the documents. When the job finishes processing all of the documents for that minute, it creates a summary document containing all of the rolled up statistics, and adds it to the database. Then, they changed the map functions to only consider these summary documents.

This solution was able to dramatically reduce the sizes of the views on disk, while still supporting the current application functionality. Since we are still persisting all of the original documents to CouchDB, it is possible to add a new statistic to the summary documents should we ever need to.

Oh, and we also picked up two new terabyte database servers, just in case :)

Paginating records in CouchDB

Like many Rails applications, we were using the popular will_paginate gem to paginate results from the database. Given the size of our data sets, pagination was an absolute necessity to keep from using up every last bit of memory.

CouchRest has a Pager class that paginates over view results, but it is in the CouchRest Core part of the library and doesn’t integrate too well with the object model part of the library. It simply returns the view results as an array of hashes. We were hoping to see a solution that would give us back an array of the corresponding ExtendedDocument objects. We were also trying to keep our application from having to know about CouchDB outside of the classes described in the previous post. Having completely different pagination strategies for the two databases would make that more difficult.

So, I decided to write some new pagination code that supported the will_paginate interface and integrated a little better with the object model part of CouchRest. I had a quick solution that same day which fetched view results and handed back an array of the corresponding ExtendedDocument objects. I then spent some time over the next two weeks modifying the code to integrate a little better with CouchRest and add support for CouchRest views, which we weren’t using.

With the new code in place, we can now paginate over a set of contest entries without having to know what database they are coming from.

ContestCampaignEntryDelegate.contest_campaign_entries.paginate(
  :page => 1, :per_page => 50)

This pagination code eventually made it into CouchRest.

Going live

With the remaining issues addressed, it was time to start the production migration. One at a time, we manually started the jobs to move the data from MySQL to CouchDB. When one job completed, we would start the next. As I mentioned before, building the views is very resource intensive. We didn’t want to completely bog down the production machine we were using to do the migration by running multiple jobs at once.

Moving the archived data from MySQL to CouchDB and building all of the views took about a week (a day for this table, a couple of days for that table, etc). Overall, it was a fairly smooth process.

For the initial import, we did not purge any of the data from MySQL. Since we needed to wait until our CouchDB databases were fully populated with all views built before we could start using them, the application needed to continue working with the data in MySQL while the migration was in progress. In anticipation of the eventual switch from MySQL to CouchDB, I added a flag in the application configuration that told the application if it should pull archived data from CouchDB. Once all of the data had been imported and all of the views had been built, we flipped the switch.

With the pouring of a celebratory beer, we watched as our application began pulling data from CouchDB in production. It was time to relax :)

The results

I really wish we had taken the time to record how long our troublesome pages were taking to load before the move to CouchDB. Sadly, we did not. All I can say is that pages that used to occasionally time out were now loading in a few seconds. Since the migration, we have also implemented a few new features that would simply not have been possible without CouchDB due to database performance issues.

The database performance issues we set out to address seem to be a thing of the past. If new ones pop up, I’m confident that we could once again utilize CouchDB to address them.

What’s next

This project was focused on addressing database related performance issues that we were facing in production. With these issues out of the way, and our CouchDB infrastructure built-out and proven, we will soon be building even more reporting capabilities that would have simply killed our old database. TextMe customers will soon be able to view their data in more ways than they could have imagined.

I am also working on a project that takes advantage of CouchDB’s schema-less nature to let our customers store and utilize data they collect from their customers. Such a feature, which essentially lets customers define their own schema, would have been a challenge to implement in a relational database. With CouchDB, its just a document.

Thoughts about this project, and CouchDB

I learned a ton while working on this project. While vaguely familiar with NoSQL databases before this project, I have just recently become aware of all of the alternatives available. With the enormous amount of data companies are beginning to collect and process, I’m sure that CouchDB and its NoSQL friends will soon become a common component in the operational environments of most companies.

The CouchDB community has been great. The CouchDB and CouchRest mailing lists are extremely active, and have been very helpful. The committers on both of these projects are active, and always eager to help. I’d specifically like to call out Jan Lehnardt and Chris Anderson from the CouchDB project. Jan has commented on a few of these posts, encouraging me to keep writing. He also suggested a more efficient implementation of the CouchRest pagination code I wrote, which I quickly implemented. Chris left a comment on the first post in this series thanking me for writing about CouchDB, and offering his assistance if I needed it. I actually took Chris up on that offer when we were running into issues regarding the sizes of the views on disk. He was quick to reply, offering several suggestions. I’d like to thank Jan and Chris for their support and encouragement.

NoSQL databases are here to stay, and CouchDB is truly unique in this area. The way it handles views, and its support for replication/synchronization set it apart from the others. There are already several large projects, like Ubuntu One, that are relying on CouchDB to deliver what nobody else can. Because of this, I’m sure CouchDB has a very bright future ahead of it.

CouchDB: Application Changes

This is part 5 in a series of posts that describe our investigation into CouchDB as a solution to several database related performance issues facing the TextMe application.

<< Part 4: Views - The Challenges Part 6: The Last Mile >>

Compared to the challenges we faced with views, modifying TextMe to interact with CouchDB was very straight forward. This post describes how we changed the TextMe code in order to use CouchDB as an archive database. Since TextMe is a Ruby on Rails application, much of the content in this post references Ruby/Rails specific libraries and frameworks. However, I feel the general concepts could be applied to any development platform.

A quick recap

Before I dive into describing how we modified our application to work with CouchDB, I’d like to quickly recap exactly what we were trying to do (see the first post in this series for a more detailed overview). TextMe is a mobile marking application. You can use it to manage SMS powered voting campaigns, contests, subscription lists, and more. The majority of these campaigns have a pre-determined lifespan. Once the campaign is over, the data collected is primarily used to calculate statistics on the campaign. This data is very important to our customers.

A few of our database tables were getting quite large, and starting to affect the performance of the application. Tuning the queries didn’t seem to help much. So, we turned to CouchDB and its views to help us store and aggregate this large amount of data.

While a campaign is still active, we do more than simply calculate statistics on the data. For example, our contest campaign needs to ensure that a winner is properly selected. The winner could be the Nth entry, every Nth entry, the first N entries, etc, depending on how the campaign is configured. Selecting the wrong winner, or more winners than we are supposed to select would obviously be bad. So, we rely on the data integrity features provided by MySQL to help us do this correctly. However, once the campaign is over, the data is only used for statistics.

Based on these requirements, we decided to use CouchDB as an archive database. When a campaign completes, we could move the data out of MySQL into CouchDB. This would make the MySQL tables smaller and more efficient, and allow us to take advantage of CouchDB’s views to handle the statistics. But, this also meant that our application would have to interact two databases instead of one, and for the most part, be ignorant of which database the data was coming from.

Configure the application to access CouchDB

Before our application can talk to CouchDB, we need to tell it a little bit about our CouchDB installation. The CouchRest-Rails plugin aims to make this as easy as possible for Rails applications. CouchRest-Rails provides the necessary hooks that allow you to specify your CouchDB configuration in a couchdb.yml file, which serves the same purpose as the default database.yml file used by Rails. Simply update this file with your CouchDB connection information, and you’ll be able to easily connect to CouchDB from within your application.

CouchRest-Rails also provides a series of Rake tasks that help you manage your databases and views.

Define the documents

The very first thing you need to do when moving data to CouchDB is to figure out what your documents will look like. I talked about this in CouchDB: Databases and Documents, so I won’t cover it again here.

Write code to create documents from relational database backed data objects

Once you know what the documents are going to look like, you need to write some code that will convert your RDBMS backed objects into a document, and store it in CouchDB.

We decided to use CouchRest to help us interact with CouchDB. CouchRest consists of two main parts: code to interact directly with CouchDB via a set of APIs just above CouchDB’s HTTP API (known as CouchRest Core), and code that allows you to create an object model backed by CouchDB. The ExtendedDocument class is the cornerstone of the object model code. ExtendedDocument is like ActiveRecord::Base in Rails. It serves as the base class for CouchDB backed objects. It provides convenient ways to define document properties, access views, define life cycle callbacks, create documents, save documents, destroy documents, paginate view results, and more.

A class extending ExtendedDocument simply needs to define the properties that make up its document.

class ArchivedContestCampaignEntry < ExtendedDocument
  use_database :contest_campaign_entry_archive

  property :campaign_id
  property :user_id
  property :entry_number
  property :winner
end

Then, all it takes to save a document in CouchDB is to create an instance of this class, set its properties, and call the object's create method.

Determine how data will be moved to CouchDB

Now that we have code that can convert RDBMS objects into documents, we need to figure out how to actually get those documents into CouchDB. This step will likely be dependent on how you plan on using CouchDB. For us, we decided it would be best to archive records after their corresponding campaigns have been over for 48 hours or more. So, we created a nightly cron job to fetch all non-archived campaigns that have been over for 48 hours or more, and move their corresponding entries to CouchDB. When a campaign's entries have been moved, an "archive" flag is set on the campaign itself, so the application knows to fetch its entries from CouchDB instead of MySQL.

One important item to point out is that CouchDB supports a bulk save operation. This operation allows you to save a batch of documents with a single HTTP request. This is a big time saver, as executing one HTTP request is obviously much quicker than executing several thousand. Our archive cron job takes advantage of this. When archiving entries for a particular campaign, we will build one document for each entry record, and then toss that document into an array. When that array exceeds a certain size, 2,500 in our case, a single bulk save request is sent to CouchDB with the array of documents. This dramatically decreases the number of HTTP requests sent to CouchDB, and the amount of time required to add data to CouchDB.

In addition, our archive job will pause to rebuild all of the views in the database after 100,000 new documents have been inserted, as well as at the end of the job. The final view rebuild is necessary since all of the view queries done from within the application ask for stale data, which will not trigger a view update. We never did any research to determine if this was better or worse than waiting until the end of the job, which could produce over 500,000 new documents, to rebuild all of the views. This step was simply driven by the gut feelings of the three engineers working on the project. I'd be interested in hearing from you if you have done any research to determine if incremental view building is more or less efficient than a big bang view rebuild after a large import has completed.

Replacing SQL queries with CouchDB views

Next, we changed the application to support the substitution of SQL queries with CouchDB views. We did this in several steps.

Identify queries being performed on the data you want to move

The first step in replacing SQL queries with CouchDB views is identifying all of the queries being performed on the data you plan on moving to CouchDB. This took a few hours to do, but was not difficult by any means. We simply searched the code for all instances of the ActiveRecord class name and the MySQL table name for the tables with data being moved. We also tracked down all ActiveRecord associations that were made to that particular table. We then made a note of what the queries did, and how they were used.

Abstract away the query

After the queries had been identified, we moved the execution of all queries to a new class. This freed the rest of the application from having to know if the data being fetched lived in MySQL or CouchDB. The new class would make that decision, delegating to the ActiveRecord class if the data was in MySQL, or the ExtendedDocumenet class if the data was in CouchDB. To start off, we simply delegate to the ActiveRecord class since we have not yet implemented the CouchDB views.

class ContestCampaignEntryDelegate
  def self.find_all_by_campaign_id_and_winner(campaign_id, winner)
    # Delegate to the ActiveRecord object
    ContestCampaignEntry.find_all_by_campaign_id_and_winner(campaign_id, winner)
  end
end

Build views to replace the queries

Now that we have the complete list of queries performed on the data that we wish to archive, we can begin building the necessary CouchDB views to support those queries for archived campaigns. I wrote about CouchDB views in previous posts. See those posts for more information.

Add methods to the ExtendedDocument class to query the views

CouchRest gives you a few options when it comes to creating and accessing your views.

One option is to use the view_by method available on all classes that extend ExtendedDocument. view_by not only makes the views easily accessible via the code, but it will also take care of creating and storing the view in the database.

In its simplest form, view_by will generate the necessary map function based on the parameters you specify. This example from the CouchRest documentation shows the map function that will be generated when view_by :date is called inside a class named Post:

function(doc) {
  if (doc['couchrest-type'] == 'Post' && doc.date) {
    emit(doc.date, null);
  }
}

view_by also lets you specify compound keys (view_by :user_id, :date) and any parameters that you wish to be used when you query your view (:descending => true).

If you need to do something a little more complicated, view_by will also let you specify the map and reduce functions to use. Here's another example from the CouchRest documentation:

 # view with custom map/reduce functions
 # query with Post.by_tags :reduce => true
view_by :tags,                                                
  :map =>                                                     
    "function(doc) {                                          
      if (doc['couchrest-type'] == 'Post' && doc.tags) {                   
        doc.tags.forEach(function(tag){                       
          emit(doc.tag, 1);                                   
        });                                                   
      }                                                       
    }",                                                       
  :reduce =>                                                  
    "function(keys, values, rereduce) {                       
      return sum(values);                                     
    }"                                                        

Another option for creating and accessing views is to use CouchRest Core. CouchRest Core, as described above, is a raw, close to the metal set of APIs that let you interact with CouchDB. These APIs let you do basically anything, including creating and accessing views. This example from the CouchRest documentation shows how you can create and query a view using CouchRest Core:

@db = CouchRest.database!("http://127.0.0.1:5984/couchrest-test")
@db.save_doc({
  "_id" => "_design/first", 
  :views => {
    :test => {
      :map => 
        "function(doc) {
          for (var w in doc) { 
            if (!w.match(/^_/)) emit(w,doc[w])
          }
        }"
      }
    }
  })
puts @db.view('first/test')['rows'].inspect 

For accessing our views, we decided to go with a hybrid approach. We didn't really feel comfortable storing our map and reduce functions inside the application code. Doing so made it less clear on how we could gracefully introduce new views or update existing views in production, keeping in mind that some of these views could take hours or days to be built for the first time. Instead, we stored our map and reduce code outside of the application, and used CouchRest-Rails to help us get those views into the database. This allows us to push new or updated views independent of the application, giving us time to build the views before anything tries to access them.

Since the views are already in the database, we decided to use CouchRest Core to access them. We created a class called ArchivedRecord to make working with CouchRest Core a little easier. ArchivedRecord contains methods that do type conversions, manage bulk save operations, incrementally regenerate the views, and more. It also contains a series of methods that help with executing views with similar behavior. For example, there are methods that will simply return the number of rows returned by a view, execute a view for a specific timeframe using the dates stored in the documents, etc. These abstractions also handle any errors that could pop up when accessing a view. Our application code uses the abstractions provided by ArchivedRecord to access the views.

Change the delegate class to call the ExtendedDocument class for archived data

Now that our views can be accessed via the application code, we can modify the delegate class to call the ExtendedDocument object's query method to fetch data for campaigns that have been archived.

class ContestCampaignEntryDelegate
  def self.find_all_by_campaign_id_and_winner(campaign_id, winner)
    campaign = ContestCampaign.find_by_id(campaign_id)
    if campaign.archived?
      ArchivedContestCampaignEntry.find_all_by_campaign_id_and_winner(campaign_id, winner)
    else
      ContestCampaignEntry.find_all_by_campaign_id_and_winner(campaign_id, winner)
    end
  end
end

Deal with the ActiveRecord associations

The last piece of the puzzle is to deal with the ActiveRecord associations. ActiveRecord associations are magic little things that make a record's associated data accessible via methods on an instance of the ActiveRecord class. For example, if I wanted to declare an association between a contest and its entries, I would simply declare the following at the top of my ContestCampaign class:

has_many :contest_campaign_entries

ActiveRecord takes care of joining the contest_campaign_entries table with the contest_campaigns table, and making all of the related campaign entries available via a call to some_contest_instance.contest_campaign_entries.

This will not work for us, as the contest_campaign_entries table will not contain any data for archived contests. So, we need to handle associations differently.

Instead of using the above code to create the association, we use the following:

has_many :active_contest_campaign_entries, 
  :foreign_key => 'contest_campaign_id', 
  :class_name => 'ContestCampaignEntry'

This more verbose version tells ActiveRecord that we want to setup an association, named active_contest_campaign_entries, on the class. Since we're circumventing the convention of naming the association after the foreign key to the associated data (which is in turned named after the associated table), we need to specify the foreign key to use, and the name of the class that backs that table.

To keep from breaking the existing code that uses the contest_campaign_entries method to obtain related entry data for a contest, we define a new method on the class with that name to fetch the associated data. The new method simply calls the corresponding method on the delegate class, which will pull the associated entries from MySQL or CouchDB, depending on if the campaign has been archived.

class ContestCampaign
  def contest_campaign_entries
    ContestCampaignEntryDelegate.contest_campaign_entries(self.id)
  end
end

class ContestCampaignEntryDelegate
  def self.contest_campaign_entries(campaign_id)
    campaign = ContestCampaign.find(campaign_id)
    if campaign.archived?
      ArchivedContestCampaignEntry.find_all_entries(campaign_id)
    else
      campaign.active_contest_campaign_entries
    end
  end
end

ActiveRecord supports other associations besides has_many. These other associations also add methods to the class that will fetch associated data from the database. In our case, some of this associated data is going to remain in MySQL. CouchRest will not (and should not) automatically fetch the corresponding data from MySQL, so we needed to handle this ourselves.

In our documents, we store the ids of the associated data still in MySQL (see campaign_id and user_id in the document snippet below). Because we have associations setup between the ContestCampaignEntry class and the ContestCampaign and User classes, ActiveRecord adds methods named campaign and user to ContestCampaignEntry that will fetch the associated objects. We need to do the same in our ExtendedDocument class.

class ArchivedContestCampaignEntry < ExtendedDocument
  use_database :contest_campaign_entry_archive

  property :campaign_id
  property :user_id
  property :entry_number
  property :winner

  def user
    @user ||= User.find_by_id(user_id)
  end

  def campaign
    @campaign ||= ContestCampaign.find_by_id(campaign_id)
  end
end

The user and campaign methods in the class above will take the ids stored in the document and fetch their corresponding objects from MySQL. In our case, these values will never change for an archived entry, so we hold on to the objects as instance variables to avoid doing additional queries when they are referenced again.

Make the ExtendedDocument class act like the ActiveRecord class

As I stated above, one of the goals was to make it so the application code does not need to know which database the data is coming from. Since the data can be returned as instances of two different classes, ContestCampaignEntry or ArchivedContestCampaignEntry, we need to make sure that both of these classes implement the same methods, and behave the same way. Failing to do so could result in hard to find bugs, or straight up exceptions.

One group of methods to pay extra attention to are the convenience methods that ActiveRecord adds to the class based on the attribute types in the database. An example of this is the attribute? accessor method that is added for boolean attributes. All attributes get an accessor named after the column in the database, but boolean attributes get an additional accessor, containing a "?" at the end. I personally use the "?" variation of the accessor method all of the time, as I feel it makes the code easier to read and understand.

CouchRest on the other hand is not able to determine in advance the data types of the properties you have stored, since CouchDB is a schema-less database. So, it is not able to do anything special for properties of a given type unless you specifically tell it what the type is. CouchRest does allow you to specify a type when you declare the property, but the current release (version 0.32) only uses this to cast property values into their proper type after they are fetched from the database. I've submitted a patch that will generate "?" accessor methods for properties with a type specified as :boolean. However, this is just one example of how your ExtendedDocument class could be subtlety different from the corresponding ActiveRecord class.

Summary

As I stated at the beginning of this post, changing the application to work with CouchDB was much more straightforward than getting the views to work as expected. Perhaps this is because I'm a developer, and not a DBA. But, great libraries like CouchRest and CouchRest-Rails certainly go a long way in helping to write clear and concise code that interacts with CouchDB. I can only hope that other programming languages have, or soon will have, libraries like these. The fact that CouchDB has a great API built on a protocol that everybody can talk, HTTP, certainly makes it possible.

Paginating Records in CouchDB via CouchRest

Update: This change has been incorporated into CouchRest version 0.30

When I began looking into replacing some of TextMe‘s large MySQL tables with CouchDB databases, one of the things I noticed right away was that pagination support was not quite there in CouchRest. I say “not quite there” because CouchRest does have the ability to fetch data from the database in paginated chunks, but the current support didn’t really fit too well with way the rest of the library interacts with CouchDB views. A helper class had to be used to fetch the data, and the data came back as hash instead of an instance of the appropriate class.

Pagination is a must for us, because these tables in particular are very large. That’s one of the main reasons why we’re moving them to CouchDB in the first place. Loading all of the data into memory at once would be troublesome to say the least.

CouchRest is still a very young library, currently on version 0.29. However, despite its age, it is already fully featured and off to a great start. So, I saw this as an opportunity to contribute to something that we have already greatly benefited from.

With a little inspiration from Rails, I decided to implement a proxy that would be created when a view was called to fetch data. The proxy would defer getting data from the database until that data was actually needed. I then implemented will_paginate style paginate and paginated_each methods on the proxy object. If either of these methods are called, only a chunk of data will be fetched from the database, and that data will be returned as an array of instances of the appropriate class. If any other method is called on the proxy, the proxy will fetch all of the data from the view, and forward the call on to the “real” array.

I decided to go with will_paginate style methods because the will_paginate gem is by far the most popular pagination solution for Rails. We use it extensively in TextMe. So, implementing the same methods would ensure that we could continue to use our existing pagination code, and the code wouldn’t have to know if it was dealing with a collection of ActiveRecord objects or a collection of CouchRest ExtendedDocument objects.

The new code also throws some methods onto the class itself that lets you paginate over instances of the class without having an instance of the proxy, or a view in your CouchRest ExtendedDocument object.

Here are some examples, pulled from the CouchRest tests:

Paginating using instance methods:

articles = Article.by_date :key => Date.today
articles.paginate(:page => 1, :per_page => 3).size.should == 3

articles = Article.by_date :key => Date.today
articles.paginated_each(:per_page => 3) do |a|
  a.should_not be_nil
end

Paginating via class methods:

articles = Article.paginate(:design_doc => 'Article', 
  :view_name => 'by_date', :per_page => 3, :descending => true, 
  :key => Date.today, :include_docs => true)
articles.size.should == 3

options = { :design_doc => 'Article', :view_name => 'by_date',
  :per_page => 3, :page => 1, :descending => true, 
  :key => Date.today, :include_docs => true }
Article.paginated_each(options) do |a|
  a.should_not be_nil
end 

Currently, the forked version of CouchRest containing this feature can be found on GitHub, at http://github.com/jwood/couchrest/tree/master. I’ve submitted a request to have this pulled into the main CouchRest repository.

Hopefully this will be helpful to others.