Browsing articles tagged with " couchdb case study"
Jun 30, 2009
John Wood

CouchDB: Databases and Documents

This is part 2 in a series of posts that describe our investigation into CouchDB as a solution to several database related performance issues facing the TextMe application.

<< Part 1: A Case Study Part 3: Views – The Advantages >>

CouchDB is a document oriented database. A document oriented database stores information as documents of related data. All of the data within a document is self contained, and does not rely on data in other documents within the database. This can be quite a shift if you’re used to working with a relational database, where data is broken up in to multiple rows existing in multiple tables, limiting (or eliminating) the duplication of data. Although radically different, the document oriented approach is a very good fit for many applications. For some applications, data integrity is not the primary concern. Such applications can work just fine without the restrictions provided by a relational database, which were designed to preserve data integrity. Instead, giving up these restrictions lets document oriented databases provide functionality that is difficult, if not impossible to provide with a relational database. For example, it is trivial to setup a cluster of document oriented databases, making it easier to deal with certain scalability and fault tolerance issues. Such clusters can theoretically provide you with limitless disk space and processing power. This is the primary reason why document oriented databases (or key/value pair databases) are becoming the standard for data storage in the cloud.

There are plenty of articles on the web describing the benefits of using a document oriented or key/value pair database, so I won’t re-hash any of that information here.

Databases

Creating a new database in CouchDB is a simple process, with no overhead. In fact, it’s as simple as issuing a single HTTP request.

curl -X PUT http://127.0.0.1:5984/my_database

There appears to be no penalty for hosting many databases within the same CouchDB server, as opposed to storing all of the documents within a single database. We took advantage of this when migrating data into CouchDB. Three very large tables were the focus of this migration, each containing between 3 to 50 million rows. We decided to store the data from each table in its own database. The data within these tables are completely unrelated, so we would never need to view data from one database combined with another. If they were related, we would have combined the tables into a single database, as CouchDB cannot create views across multiple databases. Also, storing each set of data in its own database provides additional flexibility. During the migration process, there were several points where we changed the structure of the documents. Having the ability to easily delete the affected database and re-populate it, without affecting any other document types, came in handy. With multiple databases, we have the flexibility to change the replication schedule for one database to be different from the others. It also makes it easy to move one or more databases to another server, should we ever choose to do so.

My colleague Dave made a few changes to CouchRest Rails to support connecting to multiple CouchDB databases within a single rails application. You simply specify the database server location in the configuration files, and then each model object can specify which database it is using.

Documents

CouchDB documents are very flexible. Documents are stored in JSON format, allowing you to take advantage of JSON arrays and dictionaries to represent collections of data. There is no external force dictating how a document should be structured, or what it should contain (as long as the document is valid JSON). Below is an example of what a document may look like for a blog post.

{
   "_id": "CouchDB: Databases and Documents",
   "_rev": "1-704787893",
   "author": "John Wood",
   "email": "john_p_wood",
   "post": "CouchDB is a documented oriented database.  A document...",
   "tags": ["couchdb", "couchdb case study", "json"],
   "comments": [
      {
         "email": "joe@somewhere.com",
         "comment": "Thanks for the information"
      },
      {
         "email": "kevin@xyz.com",
         "comment": "CouchDB sounds pretty interesting"
      }
   ]
}

Schema-less

Probably the best part about the document oriented approach is the ability to make each document different from the next. There is no schema to enforce that a document contains specific information. This makes CouchDB a great fit if your application needs to store data that can be wildly different between objects of the same type. In a relational database, this is usually handled by serializing the data in some format, writing the serialized data to the database, and de-serializing the data when it is read by the application. However, this is really nothing more than a hack. Querying the data in such columns can be a nightmare. And, the serialization/de-serialization process is just one more thing that can go wrong. In a document oriented database, there is no need for such a hack. You simply code your CouchDB views to account for the fact that certain fields may not be in the document, and act accordingly (either defaulting to some value, or simply move onto the next document in the database).

Self contained

The most important thing to remember about documents is that they are self contained. All of the data representing a particular concept is right there in the document. (This is a bit of a fabrication, as it is completely possible to establish relationships between documents by having one document store the unique id of another document. However, these links are not directly supported by CouchDB, and can be easily broken.) So if you are moving from a relational database to CouchDB, you should de-normalize your data as much as possible while defining the structure of your document. JSON arrays and dictionaries can help tremendously when de-normalizing relationships. If there is only one piece of information from the relationship worth storing in the document, then an array works great (see the “tags” property above). For relationships with more complex data structures, an array of dictionaries fits the bill quite nicely (see the “comments” property above).

The document id

Another important point to consider when designing your document structure is defining what you will use as the id of the document. The id must be unique not only in that database, but all instances of the same database if you happen to be running inside a database cluster. CouchDB uses document ids to replicate changes between servers. Auto-generated sequential keys are a poor fit for this. While wildly popular in relational databases, auto-generated sequential keys throw a wrench into the gears of the replication process. If each database in the cluster was responsible for generating its own sequential ids, it is highly likely that different documents on different servers could be assigned the same id, which would make CouchDB think that two distinct documents are the same document. Badness would surly ensue.

Instead, it is recommended that you use the data’s natural key as the id of your document. The natural key is some field, or combination of fields, in your document that uniquely identifies that document. In the example above, the title of the blog post is a good fit for a natural key. It is not very likely that I will be writing posts with the same title. If you happen to enjoy writing about the same stuff over and over, perhaps the title of the post combined with the date and time it was created would be a better fit. Either way, the id should be composed from data within the document.

If you do not provide an id, CouchDB will provide one for you. CouchDB uses an algorithm that makes it virtually impossible for multiple CouchDB instances to generate the same id. However, I have read articles on the web indicating that this is a very slow operation, so you may want to avoid letting CouchDB generate an id for you. Regardless, natural keys pulled straight from data within your document always make better ids, as they are easier to read, and more identifiable.

For a few of our documents, we used the sequential key generated by MySQL as the id :) I know how stupid this sounds, given the last few paragraphs. However, I think this was the best choice for an id in our case. The data contained in these documents are basically a collection of ids to rows that exist, and will remain in MySQL. None of the data within the document would be any more readable than the MySQL id. Also, since all of the keys were originally generated in a single MySQL database, they are guaranteed to be unique. As of right now, we always plan on creating the data in MySQL, and then “archiving” it to CouchDB at a later date, so this approach should continue to work just fine.

Supporting existing functionality

If you are migrating data from a relational database to CouchDB, there is another important item to think about. If your application needs to interact with CouchDB in the same way that it did with the relational database, then you need to make sure that the CouchDB views you build will be able to replace any SQL queries that are done against that data in the relational database. In order to make this happen, the CouchDB document will need to contain all of the necessary information for you to build views to replace those queries, if you intend on supporting the same functionality. Remember, there are no JOINs in CouchDB.

Summary

I think the way CouchDB handles databases and documents is very straight forward. Once you get used to the idea that there could be multiple instances of databases in a cluster, and that documents should be self contained, the rest is cake. The schema-less approach has the potential to open a lot of doors. I know that we’re already making plans to take advantage of it.

Jun 15, 2009
John Wood

CouchDB: A Case Study

This is part 1 in a series of posts that describe our investigation into CouchDB as a solution to several database related performance issues facing the TextMe application.

Part 2: Databases and Documents >>

The wall was quickly approaching. After only a few short years, several of our database tables had over a million rows, a handful had over 10 million, and a few had over 30 million. Our queries were taking longer and longer to execute, and our migrations were taking longer and longer to run. We even had to disable a few customer facing features because the database queries required to support them were too expensive to run, and were causing other issues in the application.

The nature of our business requires us to keep most if not all of this data around and easily accessible in order to provide the level of customer support that we strive for. But, it was becoming very clear that a single database to hold all of this information was not going to scale. Besides, it is common practice to have a separate, reporting database that frees the application database from having to handle these expensive data queries, so we knew that we’d have to segregate the data at some point.

Being a young company with limited resources, scaling up to some super-powered server, or running the leading commercial relational database was not an option. So, we started to look into other solutions. We tried offloading certain expensive queries onto the backup database. That helped a little, but the server hosting the backup database simply didn’t have enough juice to keep up with the load. We also considered rolling up key statistics into summary tables to save us from calculating those stats over and over. However, we realized that this was only solving part of the problem. The tables would still be huge, and summary tables would only replace some of the expensive queries.

It was about this time that my colleague Dave started looking into CouchDB as a possible solution to our issues. Up until this point, I had never heard of CouchDB. CouchDB is document oriented, schema-free database similar to Amazon’s SimpleDB and Google’s BigTable. It stores data as JSON documents and provides a powerful view engine that lets you write Javascript code to select documents from the database, and perform calculations. A RESTful HTTP/JSON API is used to access the database. The database boasts other features as well, such as robust replication, and bi-directional conflict detection and resolution.

The view engine is what peeked our interest. Views can be rebuilt whenever we determine it is necessary, and can be configured to return stale data. Stale data? Why would I want stale data?, you may be asking yourself. Well, one big reason comes to mind. Returning stale data is fast. When configured to return stale data, the database doesn’t have to calculate anything on the fly. It simply returns what it calculated the last time the view was built, making the query as fast as the HTTP request required to get the data. The CouchDB view engine is also very powerful. CouchDB views use a map/reduce approach to selecting documents from the database (map), and performing aggregate calculations on that data (reduce). The reduce function is optional. CouchDB supports Javascript as the default language for the map and reduce functions. However, this is extensible, and there is support out there for writing views in several other languages.

In our case, we are planning to use CouchDB as an archive database that we can move old data to once a night. Once the data is moved to the CouchDB database, it would no longer be updated, and would only be used for calculating statistics in the application. Since we would only be moving data into the database once a day, we only need to rebuild the views once a day. Therefore, all queries could simply ask for (and get) stale data, even when the views were in the process of being rebuilt. Also, moving all of the old data out of the relational database would dramatically reduce the size of the specific tables, improving the performance of the queries that hit those tables.

I’m really looking forward to this partial migration to CouchDB. The ability to add new views to the database without affecting existing views gives us the flexibility we need to grow the TextMe application to provide better, more specific, and more relevant statistics. In marketing, statistics are king. Since TextMe is a mobile marketing tool, we want it to be able to provide all of the data that our customers are looking for, and more. I feel that by moving to CouchDB, we will not only be able to re-activate those features that we had to disable due to database performance, but also add more features and gather more statistics that would have otherwise been impossible with our previous infrastructure.

The migration to CouchDB was not always straight forward. We faced several challenges, and learned many lessons over the past month. All of those challenges will be addressed here.

In the coming posts, I plan to talk about:

  • Structuring your CouchDB databases, and the documents within them.
  • More details about CouchDB views.
  • The application code necessary to talk to CouchDB.
  • Migrating parts of an existing application from a relational database backed by ActiveRecord to CouchDB.
  • How the CouchDB security model differs from a traditional relational database.

Stay tuned!

Pages:«12

GitHub