Managing Development Data for a Service Oriented Architecture

A service oriented architecture (SOA) provides many benefits. It allows for better separation of responsibilities. It simplifies deployment by letting you only deploy the services that have changed. It also allows for better scalability, as you can scale out only the services that are being hit the hardest.

However, a SOA does come with some challenges. This blog post addresses one of those challenges: managing a common dataset for a SOA in a development environment.

The Problem

With most SOAs there tends to be some sharing of data between applications. It is common for one application to store a key to data which is owned by another application, so it can fetch more detailed information about that data when necessary. It is also possible that one application may store some data owned by another application locally, to avoid calling the remote service in certain scenarios. Either way, the point is that in most SOAs the applications are interconnected to some degree.

Problems arise with this type of architecture when you attempt to run an application in isolation with an interconnected dataset. At some point, Application A will need to communicate with Application B to perform some task. Unfortunately, simply starting up Application B so Application A can talk to it doesn’t necessarily solve your problem. If Application A is trying to fetch information from Application B by key, and Application B does not have that data (the application datasets are not “in sync”), then the call will obviously fail.

Stubbing Service Calls

Stubbing service calls is one way to approach this issue. If Application A stubs out all calls to Application B, and instead returns some canned response that fulfills Application A‘s expectations, then there is no need to worry about making sure Application B‘s data is in sync. In fact, Application B doesn’t even need to be running. This greatly simplifies your development environment.

Stubbing service calls, however, is very difficult to implement successfully.

First, the stubbed data must meet the expectations of the calling application. In some cases, Application A will be requesting very specific data. Application A, for example, may very well expect the data coming back to contain certain elements or property values. So any stubbing mechanism must be smart enough to know what Application A is asking for, and know how to construct a response that will satisfy those expectations. In other words, the response simply can’t contain random data. This means the stubbing mechanism needs to be much more sophisticated (complex).

Handling calls that mutate data on the remote service are especially difficult to handle. What happens when the application tries to fetch information that it just changed via another service call? If the requests are stubbed, it may appear that the call to mutate the data had no effect. This could possibly lead to buggy behavior in the application.

Also, if you stub out everything, you’re not really testing the inter-application communication process. Since you’re never actually calling the service, stubs will completely hide any changes made to the API your application uses to communicate with the remote service. This could lead to surprises when actually running your code in an environment that makes a real service call.

Using Production Data

In order for service calls to work properly in a development environment, the services must be running with a common dataset. Most people I’ve spoken with accomplish this by downloading and installing each application’s production dataset for use in development. While this is by far the easiest way to get up and running with a common dataset, it comes with a very large risk.

Production datasets typically contain sensitive information. A lost laptop containing production data could easily turn into a public relations disaster for your company, and more importantly it could lead to severe problems for your customers. If you’ve ever had personal information lost by a 3rd party then you know what this feels like. Even if your hard drive is encrypted, there is still a chance that a thief could gain access to the data (unless some sort of smart card or biometric authentication system is used). The best way to prevent sensitive information from being stolen by keeping it locked up and secure, on a production server.

Using Scrubbed Production Data

Using a production dataset that has been scrubbed of sensitive information is also an option. This approach will get you a standardized dataset, without the risk of potentially losing sensitive information (assuming your data scrubbing process is free of errors).

However, if your dataset is very large, this may not be a feasible option. My MacBook Pro has a 256GB SSD drive. I know of datasets that are considerably larger than 256GB. In addition, you have less control over what your test data looks like, which could make it harder to test certain scenarios.

Creating a Standardized Dataset

The approach we’ve taken at Centro to address this issue is to create a common dataset that individual applications can use to populate their database. The common dataset consists of a series of YAML files, and is stored in a format that is not specific to any particular application. The YAML files are all stored together, with the thought that conflicting data is less likely to be introduced if all of the data lives in the same place.

The YAML files may also contain ERB snippets. We are currently using ERB snippets to specify dates.

- id: 1
  name: Test Campaign
  start_date: <%= 4.months.ago.to_date %>
  end_date: <%= 3.months.ago.to_date %>

Specifying relative dates using ERB, instead of hard coding them, gives us a dataset that will not grow stale with time. Simply re-seeding your database with the common dataset will give you a current dataset.

Manually creating the standardized dataset also enables us to construct the dataset in such a way that edge cases that are not often seen in production data are exposed, so we can better test how the application will handle that data.

Importing the Standardized Data into the Application’s Database

A collection of YAML files by itself is useless to the application. We need some way of getting that data out of the YAML files and into the application’s database.

Each of our applications has a Rake task that reads the YAML files that contain the data it cares about, and imports that data into the database by creating an instance of the model object that represents the data.

This process can be fairly cumbersome. Since the data in the YAML files are stored in a format that is not specific to any particular application, attribute names will often need to be modified in order to match the application’s data model. It is also possible that attributes in the standardized dataset will need to be dropped, since they are unused by this particular application.

We solved this, and related issues by building a small library that is responsible for reading the YAML files, and providing the Rake tasks with an easy to use API for building model objects from the data contained in the YAML file. The library provides methods to iterate over the standardized data, map attribute names, remove unused attributes, or find related data (perhaps in another data file). This API greatly simplifies the application’s Rake task.

In the code below, we are iterating over all of the data in the campaigns.yml file, and creating an instance of our Campaign object with that data.

require 'application_seeds'

namespace :application_seeds do
  desc 'Dump the development database and load it with standardized application seed data'
  task :load, [:dataset] => ['db:drop', 'db:create', 'db:migrate', :environment] do |t, args|
    ApplicationSeeds.dataset = args[:dataset]

  def seed_campaigns
    ApplicationSeeds.campaigns.each do |id, attributes|
      ApplicationSeeds.create_object!(Campaign, id, attributes)

With the Rake task in place, getting all of the applications up and running with a standardized dataset is as simple as requiring the seed data library (and the data itself) in the application’s Gemfile, and running the Rake task to import the data.

The application_seeds library

The library we created to work with our standardized YAML files, called application_seeds, has been open sourced, and is available on GitHub at

Drawbacks to this Approach

Making it easy to perform real service calls in development can be a double edged sword. On one hand, it greatly simplifies working with a SOA. On the other hand, it makes it much easier to perform service calls, and ignore the potential issues that come with calling out to a remote service. Service calls should be limited, and all calling code should be capable of handling all of the issues that may result from a call to a remote service (the service is unavailable, high latency, etc).

Another drawback is that test data is no substitute for real data. No matter how carefully it is constructed, the test dataset will never contain all of the possible combinations that a production dataset will have. So, it is still a good idea to test with a production dataset. However, that testing should be done in a secure environment, where there is no risk of the data being lost (like a staging environment).

Choosing the Right Host for Your Web Application

There is no shortage of options when it comes to web application hosting. However, the capabilities of these hosts can vary widely from one to the next. Therefore, it is important to understand what all of your options are, and which one is the best fit for your application.

Shared Hosting

Shared hosting (HostMonster and 1and1) typically gives you a user account on a shared server. The tools provided can vary quite a bit from host to host, but most provide a few ways to upload your site to the server (FTP, SSH, etc), limited control over the web server via .htaccess files, email accounts, the ability to create a specified number of databases (typically MySQL or Postgres), and support for a few programming languages (typically PHP, Python, Ruby).

Shared hosting is great for static web sites. It is very inexpensive, usually includes a free domain name, and most plans offer more disk space and bandwidth then you will ever need. It is also fairly easy to get up and running. Simply upload your website into a certain directory, and you’re good to go.

However, the restricted environment can be a hassle if you’re trying to deploy a more complex web application. You may not have the ability to install certain software. And, the fact that you are using a shared server means that resources such as memory and CPU will be shared by all of the users on that machine. The memory footprint of most web applications would likely exceed the per-account memory quota specified by a shared hosting provider.

Shared hosting is good for static sites and very basic dynamic sites (CGI scripts, embedded PHP, etc), but not much else.

Traditional VPS

Virtual Private Server hosting (Linode) gives you access to your own virtual server on shared hardware. After you select the operating system, amount of memory, and disk space you would like for your server, the VPS host will build your VPS from a default disk image it has created. When your VPS comes online, you will be able to access it via SSH.

You have complete control over the software you run on the server. This is an absolute must for the hosting of any non-trivial web application. VPS solutions are also reasonably priced for the amount of flexibility they provide. And while your VPS runs on the same physical machine as others, you retain exclusive control over your environment.

But, don’t forget, you’re running on shared hardware. While you have dedicated slices of certain resources, like memory and disk space, other resources like CPU time are up for grabs. If your VPS happens to reside on the same physical machine as another VPS that is using a large chunk of the CPU, your application may appear sluggish as a result. You are also responsible for securing your VPS. Even though your server is “virtual”, it is still a server on the internet, and can certainly be hacked. Therefore, it’s necessary that you have some system administration knowledge so that you can properly secure and administer your server. Scalability can also be a challenge with VPS hosting. If it turns out you need more memory or disk space for your server, you need to “resize” your server. Most providers can do this automatically, making sure that all of your data and services remain untouched. However, your server needs to be taken offline during part of the resizing process.

Traditional VPS hosting is good for web applications with low / predictable traffic.

Dedicated Server Hosting

Dedicated server hosting (razorservers) gives you just that, your own dedicated, internet connected server. Dedicated server hosting shares many of the pros and cons as VPS hosting, with a few exceptions.

With dedicated hosting, there is nothing to share, so you don’t have to worry about a greedy neighbor sucking up all of the CPU. And, since you are the only resident on this machine, it is more secure than the multi-tenant options.

However, this control and security doesn’t come cheap. Dedicated servers can be quite expensive. And since you’re dealing with a physical machine, upgrading it (to increase the amount of RAM or disk space for example) is not as simple as it is with the VPS options. The hosting company actually needs to take the machine offline, and physically install the upgrade.

Dedicated server hosting is good for web applications with high / predictable traffic, or for services that need dedicated resources (like a database server).

Application Platform Hosting

Application platform hosting (Heroku, Google App Engine) doesn’t provide a specific machine to run your web application on. In fact, you don’t have access to the machine at all. Instead, you are provided with a set of tools that can be used to deploy and manage your application. You are also given a set of services that your application can use for data storage, email delivery, etc.

Application hosting platforms are a great way to deploy a new application. They take care of all of the deployment details, letting you focus on building the application. In addition, they usually provide a free plan that works well for low traffic applications. I am currently using one (Heroku) for AuroraAlarm. Scalability is also a big feature of these platforms (this is typically when you break out of the free plan). I can only speak for Heroku , since that is the only application platform host I’ve used, but extra instances of the application can be spun up with a simple command or action on the application’s control panel UI. These extra instances can be taken offline just as easily, letting you deal with spikes in traffic, only paying for the extra capacity when you need it.

While easy to get up and running, and capable of handling large spikes in traffic, application platform hosting is not for everyone. These hosts typically have certain restrictions on their environment in order to support quick scaling. Heroku, for example, doesn’t let you write to the file system. You also don’t have access to the machine, and cannot install any software that may be needed for your application. Heroku addresses this issue with a large selection of plugins, but heavy plugin usage can quickly get expensive. In addition, heavy reliance on plugins or other services only provided by the host can help lock you in to that particular vendor, making it difficult to switch hosts down the road.

Application platform hosting is good for getting web applications up and running fast with little effort, and good for web applications with low average traffic, but need to handle occasional bursts.

Cloud Hosting

Cloud hosting (Rackspace, Amazon EC2) is very similar to VPS hosting…but on steroids. With cloud hosting, you can easily setup and/or teardown individual VPS instances, letting you scale out or in as necessary. Cloud hosting companies usually provide a few different mechanisms for doing this, such as a account management dashboard or an API.

Cloud hosting provides all of the same benefits as VPS hosting, but with better scalability. If you application is designed to be deployed on more than one machine, then it is not to difficult to set it up so that you can easily scale out the parts of the application that get the most traffic, and have all of them point to the same set of shared services (the database, memcache server, etc). With cloud hosting providers, you only pay for what you use.

The biggest downside to cloud hosting is that it complicates your deployment. If you’re not careful, managing your production environment can quickly become a pain. Care must be taken to ensure that the same versions of the same software is running on all “identical” cloud boxes. Tools like capistrano can deploy your application to multiple servers at the same time. You must also ensure that everything is configured identically on all of the machines. Tools like puppet and chef can help manage this.

Cloud hosting is good for large applications that may consist of several sub applications (components) or may need to scale out to handle large amounts of traffic.

Hybrid Hosting

Hybrid Hosting (Rackspace RackConnect) is a mix of dedicated server hosting and cloud hosting, combining the two to offer the reliability of dedicated servers with the flexibility of cloud servers.

With hybrid hosting, you can keep your resource intensive services (like your database) on dedicated hardware to ensure they have all of the resources they need. Then, you can deploy the parts of your application that need to scale on demand to the hybrid hosting company’s cloud. All machines can reside in the same data center, allowing for low latency between the dedicated servers and the cloud servers.

However, with hybrid hosting, you get the downsides of dedicated hosting and cloud hosting as well. Like dedicated server hosting, hybrid hosting is not cheap. And, as with cloud hosting, deployment of your application becomes much more complicated. Even more so with hybrid hosting, because you need to determine which services get deployed to the dedicated servers, and which services get deployed to the cloud servers.

Hybrid hosting is good for large applications with services that require dedicated hardware AND the ability to scale out other services on demand to deal with large amounts of traffic.


There are a ton of options out there when it comes to web application hosting. Each type brings its own strengths and weaknesses to the table, so choosing the right one really depends on your application. Once you determine the right type of hosting, the next step is identifying a hosting company to work with. There is no shortage of options there as well, but we’ll save that discussion for another day…

Introducing AuroraAlarm

I just finished up work on my latest side project, AuroraAlarm.

AuroraAlarm is a FREE service that will notify you via SMS when conditions are optimal for viewing the Aurora Borealis in your area.

I have always enjoyed watching the weather and the stars. However, I’m not much of a night owl, and every attempt I’ve made to stay up late enough to see an Aurora after a solar event has failed, miserably. Time and time again I would fall asleep, and wake up to find that the conditions were great for viewing an Aurora the night before.

I wanted something that would wake me up if there was a possibility of seeing an Aurora. So, I created AuroraAlarm to do just that.

How it works

Every evening, AuroraAlarm checks to see if any solar events have occurred that could trigger a geomagnetic storm, which can produce an Aurora. If a solar event has occurred, it will send a text message notifying all users of the event, asking if they would like to receive a text message during the next few nights if conditions are optimal for viewing the Aurora.

If they indicate that they would like to be notified, AuroraAlarm will send them a text message at night if conditions are optimal for viewing the Aurora in their area.

What are optimal conditions?

  • Dark. Likely in the hours past midnight.
  • A Kp index strong enough to push the Aurora down to their geomagnetic latitude.
  • Clear skies.
  • A dark moon.

The goal

I have no idea if the aurora will be visible from where I live. Honestly, I think there may be a bit too much ambient light for me to see it. But, with the help of AuroraAlarm, at least I’ll be able to find out. And, I’m really not too far (about a 15 minute drive) from some REALLY dark areas. I certainly wouldn’t rule out making this short trip if it allowed me to see one of nature’s most fantastic displays.

The Beauty of Redis

If I had to name a single piece of software that has impressed me lately, it would undoubtably be Redis. Not only is this key/value store on steroids blazing fast, but it is also very simple, and incredibly powerful.


How simple, you ask?

redis> SET mykey “Hello”
redis> GET mykey

That simple.

It’s also a breeze to install and get up and running. The suggested way of installing Redis isn’t to fetch some pre-compiled package for your Linux distribution. It is to download the source code (a tiny 655K tarball) and build it yourself! This can be a real crap shoot for most software, but since Redis only depends on a working GCC compiler and libc, it is not an issue at all. It just works.

After it is installed, you can start it by simply running


at the command line. The quickstart guide also has some easy-to-follow instructions on how to start Redis automatically at boot as a daemon.


Redis is a very powerful piece of software. This power, I believe, is a direct result of its simplicity.

Redis is so much more than your run of the mill key/value store. In fact, calling it a key/value store would be like calling the Lamborghini Aventador a car. It would be far more accurate to call Redis a key/data structure store, because Redis natively supports hashes, lists, sets, and sorted sets as well. These data structures are all first class citizens in the Redis world. Redis provides a host of commands for directly manipulating the data in these data structures, covering pretty much any operation you would want to perform on a hash, list, set, or sorted set. Therefore, it is super simple to perform tasks like incrementing the value of a key in a hash by 1, push multiple values onto the end of a list, trim a list to the specified range, perform a union between two sets, or even return a range of members in a sorted set, by score, with scores ordered from high to low.

This native support for data structures, combined with Redis’ incredible performance, make it an excellent complement to a relational database. Every once in a while we’ll run into an issue where, despite our best efforts, our relational database simply isn’t cutting the mustard performance wise for a certain task. Time and time again, we’ve successfully been able to delegate these tasks to Redis.

Here are some examples of what we are currently using Redis for at Signal:

  • Distributed locking. The SETNX command (set value of key if key does not exist) can be used as a locking primitive, and we use it to ensure that certain tasks are executed sequentially in our distributed system.
  • Persistent counters. Redis, unlike memcache, can persist data to disk. This is important when dealing with counters or other values that can’t easily be pulled from another source, like the relational database.
  • Reducing load on the relational database. Creative use of Redis and its data structures can help with operations that may be expensive for a relational database to handle on its own.

When Not To Use Redis

Redis stores everything in RAM. That’s one of the reasons why it is so fast. However, it is something you should keep in mind before deciding to store large amounts of data in Redis.

Redis is not a relational database. While it is certainly possible to store the keys of data as the values of other data, there is nothing to ensure the integrity of this data (what a foreign key would do in a relational database). There is also no way to search for data other than by key. Again, while it is possible to build and maintain your own indexes, Redis will not do this for you. So, if you’re looking to store relational data, you should probably stick with a relational database.


It is very clear that the Redis team has put a ton of effort into making sure that Redis remains simple, and they have done an amazing job.

It’s worth pointing out that Redis has some of the best online documentation that I have ever seen. All commands are easy to find, clearly documented, with examples and common patters of usage. AND the examples are interactive! Not sure what the result of a certain command will be? No need to install Redis and fire it up locally. You can simply try it right there in the browser.

With client libraries in virtually every programming language, there is no reason not to give it a try. You’ll be glad you did.

Want to Build a Better Web API? Build a Client Library!

A solid web API can be an important thing to have. Not only is it great to give users direct access to their data, but exposing data and operations via a web API enables your users to help themselves when it comes to building functionality that doesn’t really make sense in the application itself (or functionality that you never really thought of). It’s also a great way for users to get more familiar with your service.

However, if your API sucks, you can rest assured that nobody will touch it. We’ve all had to deal with crappy web APIs, the ones that make you jump through hoops in order to perform a task that should be dead simple to do. Web APIs should make the simple tasks easy, and the hard tasks possible. To add to the challenge, APIs are notoriously difficult to change. Even with a solid versioning scheme, it is often a real chore to get your users to stop using the deprecated API in favor of the new version. So, it’s important to do a good job the first time.

When building a web API, identifying the tasks that one might want to perform can sometimes be difficult to see when you’re surrounded by JSON, XML, GETs, POSTs, PUTs, DELETEs, and HTTP status codes. While it can be easy to see what single actions you would want to expose, seeing how those actions may interact with each other can be much more difficult. Sometimes you need to take a step back, away from the land of HTTP, in order to see your API as another programmer would see it.

Building a client library that wraps your web API is a great way to do this. It’s relatively easy to imagine how your requests and responses could be represented as objects. The largest benefit of this exercise is to take it a step further, and give the user of your client library the ability to determine what they should do next. Simply knowing if an API call succeeded or failed is usually not enough. Users of your client library need to be able to determine why the request failed, and understand what they can do about it. This extends well beyond the lifecycle of a single HTTP request and response.

Communicating errors

There are several different ways to communicate errors to the user. The proper use of HTTP status codes is one such way. The 4xx class of status codes are specifically intended to be used to communicate that something was wrong with the client’s request. If your API methods are simple, and specific in their purpose, you may be able to rely on HTTP status codes alone to communicate the various causes of failure to the client.

If your API method is complex, and could result in many different failure scenarios, you should first try to break it down into smaller, more specific API methods :) If that can’t be done, then another option is to return some easily parseable text in the response body (JSON or XML) that includes an error code that identifies the specific failure scenario. The response body could be as simple as:

{ "error_code" : 123 }

You could also provide a description of the error in the response as well. This helps users getting started with the API, saving them from having to constantly refer to your API’s documentation every time they get an error:

  "error_code" : 123,
  "error_message" : "A widget with that name already exists"

The important thing is that all failure cases be easily identifiable via a specific, documented code (HTTP status code or custom error code). Error messages should be seen as purely supplemental information. At no point should your users have to parse the error message to determine what happened, or what they should do next.

Isn’t this the same as “dogfooding”

Not exactly. Dogfooding simply involves using what you have created. You could easily dogfood your web API by firing HTTP requests at it using a simple HTTP client library. It is not until you need to take different actions based on different responses that you really start to see if you are properly communicating the result of the request. Building a client helps with this because it forces you to think about the different results and error scenarios in order to decide how your client should handle them. Which failures should raise exceptions? What sort of exception should be raised? How should non exceptional failures be communicated to the caller?

The next step in this process would be to build an application that uses your client library. That step could help identify issues with your client library, just like building the client library helps identify issues with your web API.

The client library

Oh, and don’t forget. At the end of the day, you’ll end up with a better designed web API, AND a great client library that your users can use to interact with your system. Not a bad deal!