The days of the relational database being a one-stop-shop for all of your persistence needs are over. A new class of application is beginning to emerge with requirements that exceed the capabilities of the relational database. Some of these applications need unlimited scalability or bullet proof fault tolerance, while others may require blazing fast access or flexible data storage. The relational database was simply not designed to meet the needs of this small but growing class. Instead, a new breed of data stores are gaining momentum. These data stores are looking at data persistence with a fresh set of eyes, diverging from the relational model considerably in order to meet these challenges.
What’s wrong with the relational database?
For 99% of the applications out there, absolutely nothing. The relational database has been the industry standard for data storage over the past 30+ years for good reason. It is an incredibly capable piece of software. Although it may not be the best tool for everything it is used for, it certainly satisfies the needs of the vast majority of applications just fine.
However, while not new, the class of applications mentioned above are becoming more common. These applications either handle enormous amounts of traffic, or deal with tremendous amounts of data. The relational database falls short in a few areas when trying to meet the demands of an application like this.
A single database server is usually not enough to support these requirements. Applications like this need a true database cluster, capable of adding storage space and processing power on the fly without the application even noticing. However, relational databases weren’t designed to operate in a cluster where all machines are capable of reading and writing data. This is largely due to the promises they make regarding data integrity. In order to fulfill these promises, the database needs easy, quick access to all of the data at all times to verify that duplicates aren’t being inserted, constraints aren’t being violated, etc. This quickly becomes a bottleneck when dealing with very large amounts of data.
There are techniques for scaling out relational databases, but they don’t address every concern. One popular technique is to use one or more slave databases for read requests, while continuing to funnel all write requests through the master database. The master database constantly synchronizes with the read only databases, so the data remains consistent between databases. This technique works great for read heavy applications, but does not help applications that perform just as many creates, updates, and deletes. Data sharding is another popular technique, which involves splitting the data up onto several different databases based on some criteria. But this pushes an extraordinary amount of complexity onto the application, as it is now responsible for determining which database to use for specific data sets. Master-master replication can be used to keep multiple master databases in sync, so any database server can perform read or write operations. However, for some applications there comes a point where the replication can’t keep up with the traffic.
Relational databases are also (intentionally) very strict when it comes to the structure of the data being stored. Data must be broken up into a series of rows and columns. Good object/relation mapping tools hide much of this awkwardness from us, but some applications deal with data that doesn’t map well into rows and columns. A simple key/value store is usually a better fit for applications like this.
How does the new breed address these problems?
The new breed of data stores, called NoSQL databases, make very few promises regarding data integrity. In this new model, data integrity becomes the application’s concern. By not having to enforce any complex data integrity rules, NoSQL databases can scale to levels way beyond that of a relational database. Adding more processing power or storage capacity can be as simple as adding a new machine to the cluster. The database can then store and process the data using any machine in the cluster.
In this model, the data being stored is self contained, and does not rely on any other data in the database. Therefore there is no need for one machine to know anything about the other machines in the cluster. This approach is quite different from the relational model, where data is broken up into multiple tables to eliminate duplicate data, and joined back together when being accessed.
Most of these databases subscribe to a theory called eventual consistency. In situations where duplicate information is scattered across different servers in the cluster, it is not feasible for the database to find all instances of that data and update it as a part of the original operation. Instead, the data will be replicated to the other database servers at a later time. Until that replication takes place, the application will be in an inconsistent state, where simultaneous queries fetching the same data could return different results. Although this sounds terrible, it turns out that in practice it is really not too big of a deal for most applications. Do all customers of an online retailer need to see the exact same set of product reviews 100% of the time? Probably not.
Also, because there are few promises regarding data integrity, NoSQL databases can offer data storage that is much more flexible. The database no longer has to enforce the uniqueness of a column, or ensure that the id of some referenced piece of data actually exists in the database. Some of these databases are true key/value data stores, where you can store just about anything. Others require a certain document format to be used (such as JSON or XML), but still allow you to freely change the contents of that document as you wish.
Still no one-stop-shop for persistence
Although NoSQL databases address some issues that can’t be addressed by relational databases, the opposite is true as well. The relational database offers an unparalleled feature set. While some of these features prevent it from serving the needs of the class of applications described above, they are absolutely required by other classes of applications. In some domains, data integrity is the number one concern. You need to look no further than the classic “try to withdraw money from the same account at the same time” example to justify the need for locks and transactions.
For the vast majority of applications out there, relational databases work great. There are a boat load of tools and libraries that support them, and software developers are very familiar with how to use them. It is safe to say that the relational database has secured its spot in IT departments and data centers around the world, and it isn’t going anywhere. It is far from dead.
An increasing amount of case studies are appearing that describe how real world applications are needing the data integrity offered by the relational database in addition to the benefits offered by NoSQL databases. I believe this trend will continue, as companies are storing more data than ever, and processing that data in different ways than previously imagined.
To address these needs, some companies are beginning to run their relational database side-by-side with one or more of the NoSQL alternatives. Extremely large data sets that require scalable storage space and processing power are moved to a NoSQL database, while everything else, especially data that needs its integrity kept in-check, remains in the relational database. The term Polyglot Persistence has been used to describe the use of multiple databases within the same project.
The benefits of polyglot persistence
The benefits are somewhat obvious. By running a relational database side-by-side with a NoSQL database, you get the best of both worlds. Strict enforcement of data integrity from the relational database, and the scalability and flexibility provided by the NoSQL database. This allows you to use the best tool for the job, depending on your use case.
There are a few scenarios where I’ve seen systems take advantage of polyglot persistence. The first scenario involves the need to perform some set of complex calculations on an extremely large data set. The data is either copied/moved from the relational database to the NoSQL database, or inserted directly into the NoSQL database by the application. The application can then use a cluster of NoSQL database servers can then divide the work, process the data, and aggregate the results. The more machines you have in your cluster, the less time the processing will take. The resulting data can either remain in the NoSQL database or be inserted into the relational database, depending on what needs to be done with the results.
The other scenario takes advantage of the schema-less nature of some NoSQL databases. While it is certainly possible to store a serialized data structure in a single column of a relational database, interacting with that data can be a bit more challenging than if that data were in a schema-less, document oriented database. This use case, after all, is what the documented oriented databases were designed for. These types of databases simply treat the data as a collection of key/value pairs, identified by a unique ID. The NoSQL databases provide ways in which you can add structure back into the document so the data inside the document can be queried. These databases are great for storing data that can be radically different from document to document, or data whose structure changes constantly.
The challenges of polyglot persistence
Polyglot persistence comes with its own set of challenges. While potentially getting the best of both worlds as far as features go, you get the complexity and hassle of dealing with not only multiple databases, but multiple databases models.
Determining which database to use to store certain data
With more than one database, you now have to decide where to store the data. It’s no longer a given. If you make the wrong decision, you could be looking at a painful migration from one database model to another as a result. To make this decision, you need to carefully examine how the data will be used.
Increased application complexity
Applications also face increased complexity as they now have to interface with two different (potentially very different) data stores. If done correctly, you should be able to isolate this complexity to the persistence layer of your application, freeing the rest of the application from having to know what database specific data is coming from. But, interfacing to multiple data stores could greatly increase the complexity of that data persistence layer. Your application will now need to know:
- How to connect to each of the databases
- What database to use for specific sets of data
- How to handle the different types of errors from each database
- How to map results from each database back to your application’s object model
- How to handle queries for information across databases
- How to mock out the different databases for testing
- Potentially, how to move data from one database to another
Addressing these concerns could result in a bunch of new application code, and with added code usually comes added complexity, and more bugs.
Increased deployment complexity
In addition to the increased application complexity, you will also face increased deployment complexity.
- Will you need to provision new hardware to host the new database?
- How will you backup the data in the new database?
- How will you manage and control changes to the configuration of the new database?
Training for developers and operational staff
Given that this database will likely be radically different from the relational database that your developers and operational staff are comfortable with, how will you bring them up to speed on how to use and manage this new database? And, given that the majority of the NoSQL databases are still very young, how will you keep your developers and operational staff up to speed with the latest developments on the project?
This is a big issue, especially in companies with large development and operations teams, and needs to be thought through carefully.
- Is there an expert you can hire to help you get up and running, and mentor your staff?
- Is there any training available that you can give to your staff?
- Who can you turn to for support when something goes wrong in production?
I’ve always been a big advocate of using the right tool for the right job. For the past 30 years, the relational database has been the de-facto standard for persistence. Creative people have managed to utilize and manipulate it to serve all sorts of different use cases, quite successfully. But just because you can fit a square peg through a round hole if you hit it with a big enough hammer doesn’t necessarily mean that you should.
NoSQL databases can be great tools for addressing data persistence cases that the relational database struggles with. In addition, each NoSQL database brings its own set of strengths and weaknesses to the table. They are becoming very important tools to have around, and I believe that our industry will see a steady increase in the adoption of these tools going forward.