CouchDB: A Case Study

This is part 1 in a series of posts that describe our investigation into CouchDB as a solution to several database related performance issues facing the TextMe application.

Part 2: Databases and Documents >>

The wall was quickly approaching. After only a few short years, several of our database tables had over a million rows, a handful had over 10 million, and a few had over 30 million. Our queries were taking longer and longer to execute, and our migrations were taking longer and longer to run. We even had to disable a few customer facing features because the database queries required to support them were too expensive to run, and were causing other issues in the application.

The nature of our business requires us to keep most if not all of this data around and easily accessible in order to provide the level of customer support that we strive for. But, it was becoming very clear that a single database to hold all of this information was not going to scale. Besides, it is common practice to have a separate, reporting database that frees the application database from having to handle these expensive data queries, so we knew that we’d have to segregate the data at some point.

Being a young company with limited resources, scaling up to some super-powered server, or running the leading commercial relational database was not an option. So, we started to look into other solutions. We tried offloading certain expensive queries onto the backup database. That helped a little, but the server hosting the backup database simply didn’t have enough juice to keep up with the load. We also considered rolling up key statistics into summary tables to save us from calculating those stats over and over. However, we realized that this was only solving part of the problem. The tables would still be huge, and summary tables would only replace some of the expensive queries.

It was about this time that my colleague Dave started looking into CouchDB as a possible solution to our issues. Up until this point, I had never heard of CouchDB. CouchDB is document oriented, schema-free database similar to Amazon’s SimpleDB and Google’s BigTable. It stores data as JSON documents and provides a powerful view engine that lets you write Javascript code to select documents from the database, and perform calculations. A RESTful HTTP/JSON API is used to access the database. The database boasts other features as well, such as robust replication, and bi-directional conflict detection and resolution.

The view engine is what peeked our interest. Views can be rebuilt whenever we determine it is necessary, and can be configured to return stale data. Stale data? Why would I want stale data?, you may be asking yourself. Well, one big reason comes to mind. Returning stale data is fast. When configured to return stale data, the database doesn’t have to calculate anything on the fly. It simply returns what it calculated the last time the view was built, making the query as fast as the HTTP request required to get the data. The CouchDB view engine is also very powerful. CouchDB views use a map/reduce approach to selecting documents from the database (map), and performing aggregate calculations on that data (reduce). The reduce function is optional. CouchDB supports Javascript as the default language for the map and reduce functions. However, this is extensible, and there is support out there for writing views in several other languages.

In our case, we are planning to use CouchDB as an archive database that we can move old data to once a night. Once the data is moved to the CouchDB database, it would no longer be updated, and would only be used for calculating statistics in the application. Since we would only be moving data into the database once a day, we only need to rebuild the views once a day. Therefore, all queries could simply ask for (and get) stale data, even when the views were in the process of being rebuilt. Also, moving all of the old data out of the relational database would dramatically reduce the size of the specific tables, improving the performance of the queries that hit those tables.

I’m really looking forward to this partial migration to CouchDB. The ability to add new views to the database without affecting existing views gives us the flexibility we need to grow the TextMe application to provide better, more specific, and more relevant statistics. In marketing, statistics are king. Since TextMe is a mobile marketing tool, we want it to be able to provide all of the data that our customers are looking for, and more. I feel that by moving to CouchDB, we will not only be able to re-activate those features that we had to disable due to database performance, but also add more features and gather more statistics that would have otherwise been impossible with our previous infrastructure.

The migration to CouchDB was not always straight forward. We faced several challenges, and learned many lessons over the past month. All of those challenges will be addressed here.

In the coming posts, I plan to talk about:

  • Structuring your CouchDB databases, and the documents within them.
  • More details about CouchDB views.
  • The application code necessary to talk to CouchDB.
  • Migrating parts of an existing application from a relational database backed by ActiveRecord to CouchDB.
  • How the CouchDB security model differs from a traditional relational database.

Stay tuned!

Strive to Limit Integration Points

Last week, I was working on a new feature of TextMe that required a call to one of our external service providers for some data. The call in particular was to lookup the carrier for a given mobile number. Sounds simple enough. However, we already had code that integrated with this provider in one component of our architecture, and I needed to make this call from another component.

A couple of options jumped out at me. I could pull the code I needed to use into a library that could be shared between the components, or implement some form of inter-process communication that would enable me to invoke the service from the one component, and have it processed by the component that already integrated with the service provider.

Pulling the code into a library would be the easier of the two to implement for sure. Like any project of reasonable size, we were already doing this for several other shared pieces of code. Adding one more to the list would be a piece of cake. The second option would require a bit more work. The component that integrates with the service provider runs as a daemon process, so using something straightforward like HTTP to handle the interprocess communication was out of the question. Instead, I’d likely have to utilize the queuing framework that we already had in place. What makes it more difficult is that the queuing library we use only handles asynchronous calls, and this would need to be a synchronous call. Not the end of the world by any means, but without a doubt more complicated than simply sucking the code into a library.

Even though option one was easier to implement, having two components in the architecture integrate with a 3rd party seemed like a bad idea. Sprinkling integration points throughout your application is usually a recipe for failure. Largely because it is only a matter of time before an integration point fails.

If we went with option one, we could have the library handle the failures. However, even if handled properly, failures like this usually have other consequences. For example, if the service never responded, it could cause requests to back up in the given component. Even if we implemented a timeout, it is likely that the timeout would be greater than the average response time, which means our system would take longer to process each request. If you had to deal with a lot of incoming requests at the time of the failure, you could be in for a world of hurt, especially if you had multiple components suffering from this issue.

With option two, we have a bit more control over the situation. First off, we would know there was one, and only one spot in our architecture that integrated with that particular service. This would allow us to better understand the potential impact of the failure, and the steps that needed to be taken to address it. Second, it would allow us to more easily implement a circuit beaker to prevent the failure from rippling across the system. If the circuit breaker was tripped, we could return an error, some sort of filler data, or queue the request up for processing at a later time. Third, we could potentially add resources to account for the situation. Since the work was being done in a completely different component, if it was simply a matter of increased latency on the part of our service provider, we could always spin up a few more instances of that component to account for the fact that some of the requests may be starting to back up.

In his fantastic book, Release It, Michael Nygard talks about integration points, along with a host of other topics regarding the deployment and support of production software. Any developer who writes code that will eventually be running in a production environment (which I hope is EVERY developer) should read this book. Regarding integration points, Michael says the following:

  • Integration points are the number-one killer of systems.
  • Every integration point will eventually fail in some way, and you need to be prepared for that failure.
  • Integration point failures take several forms, ranging from various network errors to semantic errors.
  • Failure in a remote system quickly becomes your problem, usually as a cascading failure when your code isn’t defensive enough.

However, even though integration points can be tough to work with, system’s without any integrations points are usually not that useful. So, integration points are a necessary evil. Our best tools to keep them in line are defensive coding, being smart about where you place the integration points in your system, and limiting the integration points in the system.

With the help of my colleague Doug Barth, we (mostly Doug) whipped up a synchronous client for the Ruby AMQP library. I then used this code to implement the synchronous queuing behavior I needed to keep the integration point where it belonged. Those interested can find the code in GitHub, at http://github.com/dougbarth/amqp/tree/bg_em.

Increase Design Flexibility by Separating Object Creation From Use

I just finished reading Emergent Design, by Scott Bain. Overall, I thought it was a pretty good book that touched on some important concepts in software design. I’ve read about one particular concept covered in the book a few times before, but the value of it didn’t sink in until I read Emergent Design. This concept states that code that creates an object should be separate from code that uses the object.

Separating code that creates an object from the code that uses the object results in a much more flexible design, which is easier to change. Creating this separation is also very easy to do. By simply avoiding a call to the new operator in the “client” code for the particular object you wish to instantiate, you are able to evolve your code to adjust to a variety of changes, most of which require no changes in the code that uses the object. Let’s walk through an example.

Let’s say we have a logging class, named Logger, that we use to log messages from our application. The class is pretty simple, and looks something like this.

public class Logger {
    private static final String logFileName = "application.log";
    private FileWriter fileWriter;
    private Class from;
    
    public Logger(Class from) {
        this.from = from;

        try {
            fileWriter = new FileWriter(logFileName, true);
        } catch (IOException e) {
            throw new RuntimeException("Log file '" + logFileName + 
                    "' could not be opened for writing.", e);
        }
    }

    public void log(String message) {
        try {
            fileWriter.write(
                from.getCanonicalName() + ": " + message + "\n");
            fileWriter.flush();
        } catch (IOException e) {
            System.err.println("Writing to the log file failed");
            e.printStackTrace();
        }
    }
}

In our application, we would typically use the Logger class like this:

Logger logger = new Logger(MyClass.class);
logger.log("Some message");

I think this is pretty typical, and seems to be the default pattern. Create the object that you need, and then use it. Simple and straightforward. However, the simplicity comes at the price of limited flexibility. For example, what if I wanted to limit the Logger class to only having one instance? Or, what if I wanted to start logging some messages to the database, and some to the file system? By combining the code that creates the object with the code that uses the object, we’ve greatly limited the ways in which we can evolve our design without affecting existing “client” code. Sure, we can work our way out of it, but since the Logger is a very popular class used by almost every other class in the system, it will require a lot of work to change.

So, how can we avoid this? How can we effectively encapsulate the creation of the object from the code that uses it? The very first “tip” in Effective Java, by Joshua Bloch, is to prefer static builder methods over constructors. Joshua suggests this for the same reasons Scott suggests separating code that creates the object from code that uses the object in Emergent Design. Instead of making your clients use the new operator to create instances of your object, provide them with a static builder method to do so.

    public static Logger getInstance(Class from) {
        return new Logger(from);
    }
    
    protected Logger(Class from) {
        this.from = from;

        try {
            fileWriter = new FileWriter(logFileName, true);
        } catch (IOException e) {
            throw new RuntimeException("Log file '" + logFileName + 
                    "' could not be opened for writing.", e);
        }
    }

Note that I changed the scope of Logger‘s constructor from public to protected. This will discourage other classes outside of the logging package from using it, while leaving the Logger class open for subclassing. With this new method in place, users of this class can now create an instance by doing the following.

Logger logger = Logger.getInstance(MyClass.class);
logger.log("Some message");

It seems silly to provide a method that simply calls new. But, doing so adds so much flexibility to the design, that Scott considers it a “practice”, or something he does every time without even thinking about it. Abandoning the constructor also opens a few doors. You are no longer required to return an instance of that specific class, giving you the freedom return any object of that type. You don’t always have to return a new instance, allowing you to implement a cache, or a singleton. You can use this flexibility to your advantage when evolving your design. Let’s see how.

Let’s say we get a request from our accounting department to log messages from code that deals with financial transactions (conveniently located in the net.johnpwood.financial package) to the database. This sounds like the birth of a new type of logger. Because clients are not using the new operator to create new instances of the Logger class, we can easily evolve Logger into an abstract class, keeping the static getInstance() method as the factory method for the Logger class hierarchy. After we have the abstract class, we can create two new subclasses to implement the individual behavior. All of this with no change to how the client uses the logging functionality.

Because the filesystem logger and the database logger don’t have too much in common, the Logger class has been slimmed down quite a bit. What remains is the interface for the Logger subtypes, defined via the abstract log() method, and a factory method to create the proper logger, which is implemented in getInstance().

public abstract class Logger {
    
    public static Logger getInstance(Class from) {
        if (from.getCanonicalName().startsWith(
                "net.johnpwood.financial")) {
            return DatabaseLogger.getInstance(from);
        } else {
            return FilesystemLogger.getInstance(from);
        }
    }
    
    protected Logger() {}
    public abstract void log(String message);
}

We now have two distinct classes that handle logging transactions. FilesystemLogger, which contains most of the old Logger code, and DatabaseLogger. FilesystemLogger should look pretty familiar.

public class FilesystemLogger extends Logger {
    private static final String logFileName = "application.log";
    private FileWriter fileWriter;
    private Class from;

    public static FilesystemLogger getInstance(Class from) {
        return new FilesystemLogger(from);
    }
    
    protected FilesystemLogger(Class from) {
        this.from = from;
        
        try {
            fileWriter = new FileWriter(logFileName, true);
        } catch (IOException e) {
            throw new RuntimeException("Log file '" + logFileName + 
                    "' could not be opened for writing.", e);
        }
    }

    @Override
    public void log(String message) {
        try {
            fileWriter.write(
                from.getCanonicalName() + ": " + message + "\n");
            fileWriter.flush();
        } catch (IOException e) {
            System.err.println("Writing to the log file failed");
            e.printStackTrace();
        }
    }
}

DatabaseLogger is also pretty simple, since I didn’t bother to implement any of the hairy database code (doesn’t help to illustrate the point…and I’m lazy).

public class DatabaseLogger extends Logger {
    private Class from;
    
    public static DatabaseLogger getInstance(Class from) {
        return new DatabaseLogger(from);
    }
    
    protected DatabaseLogger(Class from) {
        this.from = from;
        establishDatabaseConnection();
    }

    @Override
    public void log(String message) {
        LoggerDataObject dataObject = 
            new LoggerDataObject(from, message);
        dataObject.save();
    }
    
    private void establishDatabaseConnection() {
        // Connect to the database
    }
}

We’ve significantly changed how the Logger works, and the client is totally oblivious to the changes. The client code continues to use the Logger as it did before, and everything just works. Pretty sweet, eh?

As you can imagine, there are many other ways you can evolve your design if you have this separation of creation and use. If we need to create a MockLogger for testing purposes, it can be created in Logger.getInstance() along with the other Logger implementations. The client would never know that it is using a mock. If we ended up creating 10 different loggers, it would be trivial to have Logger.getInstance() delegate the creation of the proper Logger instance to a factory, moving the creation logic out of the Logger class. Again, no changes to the client.

Separating creation from use also allows you to easily evolve your class into a singleton (or any other pattern that controls the number of instances created). This doesn’t make much sense for Logger, since each unique Logger instance contains state. However, it does make sense for some classes. Evolving your class into a singleton simply requires a static instance variable on the class containing the instance of the singleton object, and an implementation of getInstance() that returns the singleton instance. If clients have already been using the getInstance() method to get an instance of the class, then no change would be required on their end. Here’s an example:

public class SomeOtherClass {
    private static SomeOtherClass instance = new SomeOtherClass();
    
    public static SomeOtherClass getInstance() {
        return instance;
    }
    
    private SomeOtherClass() {}
}

It is worth pointing out that static builder methods are not the only way to achieve this separation. Dependency injection frameworks like Spring and Guice do all of this for you. They take on the responsibility of creating the objects, and getting the instances to the code that uses them. If you are a disciplined developer, and never “cheat” by instantiating the objects directly, then all of the same benefits outlined above apply when using a dependency injection framework.

Like everything in life, there are cons that go along with the pros. Separating the code that creates an object from the code that uses the object is not the default pattern. It is not the norm. It will take time for you and your co-workers to get comfortable with this pattern. API documentation tools don’t “call out” static builder methods like they do constructors. This could have an effect on anybody using your library. Dependency injection frameworks take the creation of objects completely out of your code, moving it to some magical, mysterious land where things just happen, somehow. This also can take some time to get used to, especially for those new to the concept.

However, I feel that the benefits of separating creation from use far outweigh the drawbacks.

In our field, change is a constant. As a profession, we’re gradually learning to stop fighting change, and to start accepting it. This means designing for change. Doing so makes everybody’s life easier, from the customer to the developer. Separating creation from use is one, quick way we can increase the flexibility of our design, with very little up front cost.

Thanks to Mahesh Murthy for reviewing this post.

Build Your Own Sandbox Application

Sandboxes are fun. A simple box full of sand, a bucket, and a shovel somehow opens the imagination like nothing else. You can build anything you want, keep it around for a while if you like it, add to it, subtract from it, or crush it in Godzilla like fashion if you so choose. You can experiment with your creation in any way imaginable, without consequence. Creating things in such a carefree environment can be refreshing, and rewarding.

I think that it is a great idea for developers to have such an environment for themselves where they can try out new technologies, techniques, or processes. Setting up a sandbox now-a-days is pretty easy. Machines are cheap, or free if you’re not picky and keep your eyes and ears open. That 4 year old PC that your cousin is throwing away because he got a brand new Dell for Christmas will fit the bill just fine. Linux is free, runs great on older computers, and has the power and flexibility to host virtually any application. Sign up for a free DNS service such as DynDNS and poof, you’ve got yourself a little server that you can reach over the Internet. My sandbox is a dusty old dual Pentium III with 512 MB of RAM, which is running Ubuntu Linux. It fits the bill quite nicely if you ask me.

But a sandbox is only half the equation. For developers, we need an application that we can play with in the sandbox. Something we can poke and prod. A sandbox application so to speak. There are several reasons why you many want to create a sandbox application.

Platform for playing with new technologies

Perhaps the biggest reason for creating and maintaining your own sandbox application is that it can serve as a platform for trying out new technologies. Doing this at work can be tough. Your boss or client may not be thrilled to hear that you completely re-wrote part of the application to use the bleeding edge release of some hot new framework because you “thought it was cool”. But, there’s nothing stopping you from doing it with your own application. Even if you rely heavily on your application, there’s no reason why you can’t fork your code, and give the new technology a try on a separate branch. If it works out you can merge the code into the main branch, and if not you can always abandon the changes. Now, this approach will only work if you have your application under source control (which you should). If you don’t mind giving the world access to your code, GitHub will host your code for free. Otherwise, it’s very easy to setup your own source control system in your sandbox.

Material to blog about

Trying out new technologies, techniques, or processes can also give you plenty of material to blog about. If the technology/technique/process you are tinkering with is hot, it is very likely that many people will be interested in reading about your experience. If you blog frequently about topics that people are interested in, you’ll steadily increase readership. This could be very good for your career, as well known programmers generally don’t have a hard time finding work. Work usually finds them.

Create something that you will use

What’s the point in going through all of this trouble to create something if you never use it? I’ve created a few applications that I use on a regular basis. Not only did these applications address some need that I was currently facing, but having a sandbox application that you actually use means that you will be more likely to maintain and enhance it.

Looks great on a resume

Employers love to hire people who show an interest in their field outside of work. I’ve found that people passionate about their field are usually better at their jobs than those who show up at 9, work on the same ol’ stuff for 8 hours, and go home. Having a sandbox application shows people that you love what you do so much that you have dedicated time outside of work to create something that you care about. This especially holds true if you’ve put serious thought into your application, and are excited to show it off to anybody who asks to see it.

Release it, and potentially help others

One of my colleagues once said,

Your parents lied to you. You are not special. There’s millions of people out there just like you.

He wasn’t trying to be mean, this time :) He was simply pointing out that you are not alone. If you are facing a problem, odds are there are hundreds or thousands (or more) of people out there who are facing that same problem. Releasing your application could be helping all of these people. Perhaps it would help them so much that they would be willing to pay for it. Wouldn’t that be nice?

Open source your application

Open sourcing your application can be great for several reasons. Perhaps you’ve just finished migrating your application to the latest version some framework. Not only can you blog about your experience, but you can share the code with others so that they can see exactly how you did it. This could potentially help others looking to migrate their applications to the same framework. You could get feedback from the community about something you could be doing better, and learn something new. People looking at your code could spot a bug, giving you the opportunity to fix it before it affects you (especially if you use your application). Some employers ask to see source code samples as part of the interview process. What sort of reaction do you think you would get if you immediately spit out several repositories that you owned on GitHub for the prospective employer to browse at their leisure? I’m not sure about you, but I’d be pretty impressed.

However, there is a potential drawback in releasing your code to the world. Ironically, it’s the same as one of the benefits. The world can see your code. This can be a bad thing if your code is full of bugs, or sloppy. So if you plan on releasing your code, take the extra care necessary to ensure that the code reflects your best effort. Your code, and you as a developer, will benefit from the extra TLC.

The next big thing

You never know which crazy, off the wall idea will turn into the next big thing. Who would of thought that there was a real need for an application that lets you tell the world what you’re doing right now. If your idea for a sandbox application turns into something with real business value, then you never know where it will end up. Large companies will often spend big bucks to buy great ideas. I’d imagine it would be pretty cool to be on the receiving end of one of those deals.

Summary

Creating a sandbox, and a sandbox application is something that every serious developer should do. If nothing more, it will give you a place to tinker with new technologies and grow as a developer. There is little to no cost to set it up, and your imagination is the only limit.