CouchDB: Databases and Documents

This is part 2 in a series of posts that describe our investigation into CouchDB as a solution to several database related performance issues facing the TextMe application.

<< Part 1: A Case Study Part 3: Views – The Advantages >>

CouchDB is a document oriented database. A document oriented database stores information as documents of related data. All of the data within a document is self contained, and does not rely on data in other documents within the database. This can be quite a shift if you’re used to working with a relational database, where data is broken up in to multiple rows existing in multiple tables, limiting (or eliminating) the duplication of data. Although radically different, the document oriented approach is a very good fit for many applications. For some applications, data integrity is not the primary concern. Such applications can work just fine without the restrictions provided by a relational database, which were designed to preserve data integrity. Instead, giving up these restrictions lets document oriented databases provide functionality that is difficult, if not impossible to provide with a relational database. For example, it is trivial to setup a cluster of document oriented databases, making it easier to deal with certain scalability and fault tolerance issues. Such clusters can theoretically provide you with limitless disk space and processing power. This is the primary reason why document oriented databases (or key/value pair databases) are becoming the standard for data storage in the cloud.

There are plenty of articles on the web describing the benefits of using a document oriented or key/value pair database, so I won’t re-hash any of that information here.

Databases

Creating a new database in CouchDB is a simple process, with no overhead. In fact, it’s as simple as issuing a single HTTP request.

curl -X PUT http://127.0.0.1:5984/my_database

There appears to be no penalty for hosting many databases within the same CouchDB server, as opposed to storing all of the documents within a single database. We took advantage of this when migrating data into CouchDB. Three very large tables were the focus of this migration, each containing between 3 to 50 million rows. We decided to store the data from each table in its own database. The data within these tables are completely unrelated, so we would never need to view data from one database combined with another. If they were related, we would have combined the tables into a single database, as CouchDB cannot create views across multiple databases. Also, storing each set of data in its own database provides additional flexibility. During the migration process, there were several points where we changed the structure of the documents. Having the ability to easily delete the affected database and re-populate it, without affecting any other document types, came in handy. With multiple databases, we have the flexibility to change the replication schedule for one database to be different from the others. It also makes it easy to move one or more databases to another server, should we ever choose to do so.

My colleague Dave made a few changes to CouchRest Rails to support connecting to multiple CouchDB databases within a single rails application. You simply specify the database server location in the configuration files, and then each model object can specify which database it is using.

Documents

CouchDB documents are very flexible. Documents are stored in JSON format, allowing you to take advantage of JSON arrays and dictionaries to represent collections of data. There is no external force dictating how a document should be structured, or what it should contain (as long as the document is valid JSON). Below is an example of what a document may look like for a blog post.

{
   "_id": "CouchDB: Databases and Documents",
   "_rev": "1-704787893",
   "author": "John Wood",
   "email": "john_p_wood",
   "post": "CouchDB is a documented oriented database.  A document...",
   "tags": ["couchdb", "couchdb case study", "json"],
   "comments": [
      {
         "email": "joe@somewhere.com",
         "comment": "Thanks for the information"
      },
      {
         "email": "kevin@xyz.com",
         "comment": "CouchDB sounds pretty interesting"
      }
   ]
}

Schema-less

Probably the best part about the document oriented approach is the ability to make each document different from the next. There is no schema to enforce that a document contains specific information. This makes CouchDB a great fit if your application needs to store data that can be wildly different between objects of the same type. In a relational database, this is usually handled by serializing the data in some format, writing the serialized data to the database, and de-serializing the data when it is read by the application. However, this is really nothing more than a hack. Querying the data in such columns can be a nightmare. And, the serialization/de-serialization process is just one more thing that can go wrong. In a document oriented database, there is no need for such a hack. You simply code your CouchDB views to account for the fact that certain fields may not be in the document, and act accordingly (either defaulting to some value, or simply move onto the next document in the database).

Self contained

The most important thing to remember about documents is that they are self contained. All of the data representing a particular concept is right there in the document. (This is a bit of a fabrication, as it is completely possible to establish relationships between documents by having one document store the unique id of another document. However, these links are not directly supported by CouchDB, and can be easily broken.) So if you are moving from a relational database to CouchDB, you should de-normalize your data as much as possible while defining the structure of your document. JSON arrays and dictionaries can help tremendously when de-normalizing relationships. If there is only one piece of information from the relationship worth storing in the document, then an array works great (see the “tags” property above). For relationships with more complex data structures, an array of dictionaries fits the bill quite nicely (see the “comments” property above).

The document id

Another important point to consider when designing your document structure is defining what you will use as the id of the document. The id must be unique not only in that database, but all instances of the same database if you happen to be running inside a database cluster. CouchDB uses document ids to replicate changes between servers. Auto-generated sequential keys are a poor fit for this. While wildly popular in relational databases, auto-generated sequential keys throw a wrench into the gears of the replication process. If each database in the cluster was responsible for generating its own sequential ids, it is highly likely that different documents on different servers could be assigned the same id, which would make CouchDB think that two distinct documents are the same document. Badness would surly ensue.

Instead, it is recommended that you use the data’s natural key as the id of your document. The natural key is some field, or combination of fields, in your document that uniquely identifies that document. In the example above, the title of the blog post is a good fit for a natural key. It is not very likely that I will be writing posts with the same title. If you happen to enjoy writing about the same stuff over and over, perhaps the title of the post combined with the date and time it was created would be a better fit. Either way, the id should be composed from data within the document.

If you do not provide an id, CouchDB will provide one for you. CouchDB uses an algorithm that makes it virtually impossible for multiple CouchDB instances to generate the same id. However, I have read articles on the web indicating that this is a very slow operation, so you may want to avoid letting CouchDB generate an id for you. Regardless, natural keys pulled straight from data within your document always make better ids, as they are easier to read, and more identifiable.

For a few of our documents, we used the sequential key generated by MySQL as the id :) I know how stupid this sounds, given the last few paragraphs. However, I think this was the best choice for an id in our case. The data contained in these documents are basically a collection of ids to rows that exist, and will remain in MySQL. None of the data within the document would be any more readable than the MySQL id. Also, since all of the keys were originally generated in a single MySQL database, they are guaranteed to be unique. As of right now, we always plan on creating the data in MySQL, and then “archiving” it to CouchDB at a later date, so this approach should continue to work just fine.

Supporting existing functionality

If you are migrating data from a relational database to CouchDB, there is another important item to think about. If your application needs to interact with CouchDB in the same way that it did with the relational database, then you need to make sure that the CouchDB views you build will be able to replace any SQL queries that are done against that data in the relational database. In order to make this happen, the CouchDB document will need to contain all of the necessary information for you to build views to replace those queries, if you intend on supporting the same functionality. Remember, there are no JOINs in CouchDB.

Summary

I think the way CouchDB handles databases and documents is very straight forward. Once you get used to the idea that there could be multiple instances of databases in a cluster, and that documents should be self contained, the rest is cake. The schema-less approach has the potential to open a lot of doors. I know that we’re already making plans to take advantage of it.

Be Sociable, Share!

    4 thoughts on “CouchDB: Databases and Documents

    1. Pingback: Linktipps #5 :: Blackflash

    2. This was great. I’ve been reading/learning more about couchdb and wondering how to get my mind out the relational structure and to a document structure. The constraints indicated that each document is self-contained and should have all the data it needs helped give me some insight. Great write-up and I’m looking to finish the additional sections.

    3. Excellent overview even for someone who did not know nothing about CouchDB before. Thank you, John, for this gem.

    Leave a Reply

    Your email address will not be published. Required fields are marked *