Migrating Data – Rails Migrations or a Rake Task?

I’ve always thought that Migrations were one of Rails’ best features. In one of the very first projects I worked on as a n00b software engineer straight out of college, schema and data migrations were a very large, and painful, part of that project’s deployment process. Being able to specify how the schema should change, and being able to check in those changes along with the code that depends on those changes, was a huge step forward. The ability to specify those changes in an easy to understand DSL, and having the ability to run arbitrary code in the migration to help make those changes, was simply icing on the cake.

But, the question of how to best migrate data, not the schema, often comes up. Should data migrations be handled by Rails migrations as well? Should they be handled by a script, or a rake task instead? There are pros and cons to both approaches, and it depends on the situation.

Using Rails Migrations

One of the features of Rails Migrations is that the app will fail to startup if you have a migration that has not been run. This is good, because running your app with an out of date schema could lead to all sorts of problems. Because of this, most deployment processes will run the migrations after the code has been deployed, but before the server is started. If your data migration needs to be run before the app starts up, then you can use this to your advantage by using a migration to migrate your data. In addition, if your data migration can be reversed, then that code be placed in the migration’s down method, fitting nicely into the “migration way” of doing things.

However, there are some pitfalls to this approach. It is bad practice to use code that exists elsewhere in your application inside of a Rails Migration. Application code evolves over time. Classes come and go, and their interfaces can change at any time. Migrations on the other hand are intended to be written once, and never touched again. You should not have to update a migration you wrote three months ago to account for the fact that one of your models no longer exists. So, if migrating your data requires the use of your models, or any other application code, it’s probably best that you not use a migration. But, if you can migrate your data by simply using SQL statements, then this is a perfectly valid approach.

Using a Rake Task

Another way to migrate production data is to write a rake task to perform the migration. Using a rake task to perform the migration provides several clear advantages over using a Rails Migration.

First, you are free to use application code to help with the data migration. Since the rake task is essentially “throw away”, it can easily be deleted after it has been run in production. There is no need to change the rake task in response to application code changing. Should you ever need to view the rake task after it has been deleted, it is always available via your source control system. If you’d like to keep it around, that’s fine to. Since the rake task won’t be run once it has been run in production, it can continue to reference classes that no longer exist, or use APIs that have changed.

Second, it is easier to perform test runs of the rake task. We usually wrap the rake task code within an ActiveRecord transaction, to ensure that if something bad happens, any changes will be rolled back. We can take advantage of this design by conditionally raising an exception at the end of the rake task, rolling back all of the changes, if we are “dry run” mode (usually determined by an environment variable we pass to the task). This allows us to perform dry runs of the rake task, and use logging to see exactly what it will do, before allowing it to modify any data. With Rails Migrations, this is more difficult, as you need to rollback the migration as a separate step, and this is only possible for migrations that are reversible.

Finally, you can easily run the rake task whenever you want. It does not need to happen as a part of the deployment, or alter your existing deployment process to push the code without running the migrations or restarting the server. This gives you some flexibility, and lets you pick the best time to perform the migration.

Our Approach

Generally, we use Rails Migrations to migrate our application’s schema, and Rake tasks to migrate our production data. There have only been a few cases where we have used Rails Migrations to ensure that a data migration took place as a part of the deployment. In all other cases, using a Rake task provides us with more flexibility, and less maintenance.

Another Approach?

Do you have another approach for migrating production data? If so, we’d love to hear it. Feel free to drop it in the comments.

Note: This article has been cross posted on the UrbanBound product blog.

Running a Private Gem Server

There are a few different ways to share code between Ruby applications, but perhaps the best known is by creating a Ruby gem. However, RubyGems.org (where gems are published by default) is a public system, and all gems published there are publicly available. This is a problem if you’re dealing with proprietary code that you need to keep private.

Thankfully, it is very easy to get up and running with your own private gem server.

Why not use Bundler to pull code from a private Git repository?

Another popular way of sharing code between applications is to use Bundler to pull code directly from a git repository. Pulling in code from a private git repository allows you to share code between applications while keeping that code private. However, this approach has some drawbacks.

Transitive dependencies don’t work well

Gems are able to specify their dependencies in the .gemspec file.

Gem::Specification.new do |spec|
  spec.add_dependency "activesupport"

But, what happens when you need to specify a dependency on a private gem? The gem specification does not allow you to point to a GitHub repository. Since we cannot specify the location of the dependency in the .gemspec, the only option is to do that in the application’s Gemfile.

In the gem’s .gemspec

# Some gem has a dependency on a private library
Gem::Specification.new do |spec|
  spec.name = "some-private-library"
  spec.add_dependency "another-private-library"

In the application’s Gemfile

# The application's Gemfile must tell bundler where to find that library
source 'https://rubygems.org'

gem 'some-private-library', git: 'git@github.com:jwood/some-private-library.git'

# Required by some-private-library
gem 'another-private-library', git: 'git@github.com:jwood/another-private-library.git'

The application should only need to care about the libraries that it directly depends on. It should not need to specify the location of transitive dependencies. This tightly couples the application to the implementation of the some-private-library gem, for no good reason.

It is less clear which version of the library you are using.

Looking at an example from a project’s Gemfile.lock, compare this:

  remote: git@github.com:redis/redis-rb.git
  revision: 9bb4156dcd43105b68e24e5efc579dddb12cf260
    redis (3.0.6)

with this:

    redis (3.0.6)

Which version of the library are you working with in the first example? How about the second? If you said 3.0.6 for the first example, then you may be incorrect. You’re actually using the version identified by 9bb4156dcd43105b68e24e5efc579dddb12cf260. That may in fact be the same version of the code with the 3.0.6 label. Or, possibly, it is pointing at a version of the code that is several commits past the 3.0.6 label.

In the second example, it is abundantly clear that you are using version 3.0.6 of the gem.

Harder to stick to released versions of the library.

Like any other codebase, libraries are often under continuous development. Commits are constantly being made. Branches are constantly being merged. While the contents of master should always work, it may not always represent what the library author/maintainer would consider releasable code. At some point, the maintainer of the library will deem the version of code in master releasable, tag the code, and cut a new version of the library. If you use Bundler to point to a private git repository, you could end up using code that is not considered releasable.

One way around this issue is to tag releases in git, and then use bundler pull in that version of the code.

gem 'rails', git: 'git://github.com/rails/rails.git', tag: 'v2.3.5'

While this will in fact lock you to the tagged version of the code, it makes updating the gem more difficult. You can no longer use bundle update rails to update to the newest released version of the gem. You will remain locked on the v2.3.5 tag until you remove the tag directive from the gem in your Gemfile.

How do I setup a private gem server?

Thankfully, setting up your own gem server is very easy. The fantastic Gem in a Box project is a self contained, full featured gem server. It is even recommended as the “way to go” by Rubygems.org if you want to run your own gem server.

After your gem server has been setup, you can continue to pull public gems from Rubygems.org, while pulling private gems from your new private gem server.

How can I publish gems to my private gem server?

It doesn’t take much work to get a set of rake tasks that you can use to publish gems to your private gem server.

First, in your .gemspec file, make sure you specify rake, bundler and geminabox as development dependencies. Make sure you use version 1.3.0 or greater of bundler, as 1.3.0 added a feature which we will take advantage of later.

  spec.add_development_dependency "rake"
  spec.add_development_dependency "bundler", ">= 1.3.0"
  spec.add_development_dependency "geminabox"

Next, require bundler’s gem tasks in your project’s Rakefile.

require "bundler/gem_tasks"

bundler/gem_tasks gives you three rake tasks that come in handy for publishing gems.

rake build    # Build some-private-library-0.0.1.gem into the pkg directory.
rake install  # Build and install some-private-library-0.0.1.gem into system gems.
rake release  # Create tag v0.0.1 and build and push some-private-library-0.0.1.gem to Rubygems

By default, the release task will publish your gem to Rubygems.org. But, that’s not what we want. With a few lines of code in our Rakefile, we can tell bundler to push the gem to our internal gem server instead.

# Don't push the gem to rubygems
ENV["gem_push"] = "false" # Utilizes feature in bundler 1.3.0

# Let bundler's release task do its job, minus the push to Rubygems,
# and after it completes, use "gem inabox" to publish the gem to our
# internal gem server.
Rake::Task["release"].enhance do
  spec = Gem::Specification::load(Dir.glob("*.gemspec").first)
  sh "gem inabox pkg/#{spec.name}-#{spec.version}.gem"

Now, to release your gem, you simply type rake release

% rake release
some-private-library 0.0.1 built to pkg/some-private-library-0.0.1.gem.
Tagged v0.0.1.
Pushed git commits and tags.
gem inabox pkg/some-private-library-0.0.1.gem
Pushing some-private-library-0.0.1.gem to http://gems.mydomain.com/...
Gem some-private-library-0.0.1.gem received and indexed.

The first time you do this, geminabox will prompt you for the location of your gem server. After that, it will remember.

How do I use the gems on my private gem server?

To use the gems on your private gem server, you simply need to tell bundler where to find them in your application’s Gemfile. This is done using the source command:

# The application's Gemfile

source 'https://rubygems.org'
source 'https://gems.mydomain.com'

gem 'some-private-library'  # Will be found on the private gem server

Managing Development Data for a Service Oriented Architecture

A service oriented architecture (SOA) provides many benefits. It allows for better separation of responsibilities. It simplifies deployment by letting you only deploy the services that have changed. It also allows for better scalability, as you can scale out only the services that are being hit the hardest.

However, a SOA does come with some challenges. This blog post addresses one of those challenges: managing a common dataset for a SOA in a development environment.

The Problem

With most SOAs there tends to be some sharing of data between applications. It is common for one application to store a key to data which is owned by another application, so it can fetch more detailed information about that data when necessary. It is also possible that one application may store some data owned by another application locally, to avoid calling the remote service in certain scenarios. Either way, the point is that in most SOAs the applications are interconnected to some degree.

Problems arise with this type of architecture when you attempt to run an application in isolation with an interconnected dataset. At some point, Application A will need to communicate with Application B to perform some task. Unfortunately, simply starting up Application B so Application A can talk to it doesn’t necessarily solve your problem. If Application A is trying to fetch information from Application B by key, and Application B does not have that data (the application datasets are not “in sync”), then the call will obviously fail.

Stubbing Service Calls

Stubbing service calls is one way to approach this issue. If Application A stubs out all calls to Application B, and instead returns some canned response that fulfills Application A‘s expectations, then there is no need to worry about making sure Application B‘s data is in sync. In fact, Application B doesn’t even need to be running. This greatly simplifies your development environment.

Stubbing service calls, however, is very difficult to implement successfully.

First, the stubbed data must meet the expectations of the calling application. In some cases, Application A will be requesting very specific data. Application A, for example, may very well expect the data coming back to contain certain elements or property values. So any stubbing mechanism must be smart enough to know what Application A is asking for, and know how to construct a response that will satisfy those expectations. In other words, the response simply can’t contain random data. This means the stubbing mechanism needs to be much more sophisticated (complex).

Handling calls that mutate data on the remote service are especially difficult to handle. What happens when the application tries to fetch information that it just changed via another service call? If the requests are stubbed, it may appear that the call to mutate the data had no effect. This could possibly lead to buggy behavior in the application.

Also, if you stub out everything, you’re not really testing the inter-application communication process. Since you’re never actually calling the service, stubs will completely hide any changes made to the API your application uses to communicate with the remote service. This could lead to surprises when actually running your code in an environment that makes a real service call.

Using Production Data

In order for service calls to work properly in a development environment, the services must be running with a common dataset. Most people I’ve spoken with accomplish this by downloading and installing each application’s production dataset for use in development. While this is by far the easiest way to get up and running with a common dataset, it comes with a very large risk.

Production datasets typically contain sensitive information. A lost laptop containing production data could easily turn into a public relations disaster for your company, and more importantly it could lead to severe problems for your customers. If you’ve ever had personal information lost by a 3rd party then you know what this feels like. Even if your hard drive is encrypted, there is still a chance that a thief could gain access to the data (unless some sort of smart card or biometric authentication system is used). The best way to prevent sensitive information from being stolen by keeping it locked up and secure, on a production server.

Using Scrubbed Production Data

Using a production dataset that has been scrubbed of sensitive information is also an option. This approach will get you a standardized dataset, without the risk of potentially losing sensitive information (assuming your data scrubbing process is free of errors).

However, if your dataset is very large, this may not be a feasible option. My MacBook Pro has a 256GB SSD drive. I know of datasets that are considerably larger than 256GB. In addition, you have less control over what your test data looks like, which could make it harder to test certain scenarios.

Creating a Standardized Dataset

The approach we’ve taken at Centro to address this issue is to create a common dataset that individual applications can use to populate their database. The common dataset consists of a series of YAML files, and is stored in a format that is not specific to any particular application. The YAML files are all stored together, with the thought that conflicting data is less likely to be introduced if all of the data lives in the same place.

The YAML files may also contain ERB snippets. We are currently using ERB snippets to specify dates.

- id: 1
  name: Test Campaign
  start_date: <%= 4.months.ago.to_date %>
  end_date: <%= 3.months.ago.to_date %>

Specifying relative dates using ERB, instead of hard coding them, gives us a dataset that will not grow stale with time. Simply re-seeding your database with the common dataset will give you a current dataset.

Manually creating the standardized dataset also enables us to construct the dataset in such a way that edge cases that are not often seen in production data are exposed, so we can better test how the application will handle that data.

Importing the Standardized Data into the Application’s Database

A collection of YAML files by itself is useless to the application. We need some way of getting that data out of the YAML files and into the application’s database.

Each of our applications has a Rake task that reads the YAML files that contain the data it cares about, and imports that data into the database by creating an instance of the model object that represents the data.

This process can be fairly cumbersome. Since the data in the YAML files are stored in a format that is not specific to any particular application, attribute names will often need to be modified in order to match the application’s data model. It is also possible that attributes in the standardized dataset will need to be dropped, since they are unused by this particular application.

We solved this, and related issues by building a small library that is responsible for reading the YAML files, and providing the Rake tasks with an easy to use API for building model objects from the data contained in the YAML file. The library provides methods to iterate over the standardized data, map attribute names, remove unused attributes, or find related data (perhaps in another data file). This API greatly simplifies the application’s Rake task.

In the code below, we are iterating over all of the data in the campaigns.yml file, and creating an instance of our Campaign object with that data.

require 'application_seeds'

namespace :application_seeds do
  desc 'Dump the development database and load it with standardized application seed data'
  task :load, [:dataset] => ['db:drop', 'db:create', 'db:migrate', :environment] do |t, args|
    ApplicationSeeds.dataset = args[:dataset]

  def seed_campaigns
    ApplicationSeeds.campaigns.each do |id, attributes|
      ApplicationSeeds.create_object!(Campaign, id, attributes)

With the Rake task in place, getting all of the applications up and running with a standardized dataset is as simple as requiring the seed data library (and the data itself) in the application’s Gemfile, and running the Rake task to import the data.

The application_seeds library

The library we created to work with our standardized YAML files, called application_seeds, has been open sourced, and is available on GitHub at https://github.com/centro/application_seeds.

Drawbacks to this Approach

Making it easy to perform real service calls in development can be a double edged sword. On one hand, it greatly simplifies working with a SOA. On the other hand, it makes it much easier to perform service calls, and ignore the potential issues that come with calling out to a remote service. Service calls should be limited, and all calling code should be capable of handling all of the issues that may result from a call to a remote service (the service is unavailable, high latency, etc).

Another drawback is that test data is no substitute for real data. No matter how carefully it is constructed, the test dataset will never contain all of the possible combinations that a production dataset will have. So, it is still a good idea to test with a production dataset. However, that testing should be done in a secure environment, where there is no risk of the data being lost (like a staging environment).

Choosing the Right Host for Your Web Application

There is no shortage of options when it comes to web application hosting. However, the capabilities of these hosts can vary widely from one to the next. Therefore, it is important to understand what all of your options are, and which one is the best fit for your application.

Shared Hosting

Shared hosting (HostMonster and 1and1) typically gives you a user account on a shared server. The tools provided can vary quite a bit from host to host, but most provide a few ways to upload your site to the server (FTP, SSH, etc), limited control over the web server via .htaccess files, email accounts, the ability to create a specified number of databases (typically MySQL or Postgres), and support for a few programming languages (typically PHP, Python, Ruby).

Shared hosting is great for static web sites. It is very inexpensive, usually includes a free domain name, and most plans offer more disk space and bandwidth then you will ever need. It is also fairly easy to get up and running. Simply upload your website into a certain directory, and you’re good to go.

However, the restricted environment can be a hassle if you’re trying to deploy a more complex web application. You may not have the ability to install certain software. And, the fact that you are using a shared server means that resources such as memory and CPU will be shared by all of the users on that machine. The memory footprint of most web applications would likely exceed the per-account memory quota specified by a shared hosting provider.

Shared hosting is good for static sites and very basic dynamic sites (CGI scripts, embedded PHP, etc), but not much else.

Traditional VPS

Virtual Private Server hosting (Linode) gives you access to your own virtual server on shared hardware. After you select the operating system, amount of memory, and disk space you would like for your server, the VPS host will build your VPS from a default disk image it has created. When your VPS comes online, you will be able to access it via SSH.

You have complete control over the software you run on the server. This is an absolute must for the hosting of any non-trivial web application. VPS solutions are also reasonably priced for the amount of flexibility they provide. And while your VPS runs on the same physical machine as others, you retain exclusive control over your environment.

But, don’t forget, you’re running on shared hardware. While you have dedicated slices of certain resources, like memory and disk space, other resources like CPU time are up for grabs. If your VPS happens to reside on the same physical machine as another VPS that is using a large chunk of the CPU, your application may appear sluggish as a result. You are also responsible for securing your VPS. Even though your server is “virtual”, it is still a server on the internet, and can certainly be hacked. Therefore, it’s necessary that you have some system administration knowledge so that you can properly secure and administer your server. Scalability can also be a challenge with VPS hosting. If it turns out you need more memory or disk space for your server, you need to “resize” your server. Most providers can do this automatically, making sure that all of your data and services remain untouched. However, your server needs to be taken offline during part of the resizing process.

Traditional VPS hosting is good for web applications with low / predictable traffic.

Dedicated Server Hosting

Dedicated server hosting (razorservers) gives you just that, your own dedicated, internet connected server. Dedicated server hosting shares many of the pros and cons as VPS hosting, with a few exceptions.

With dedicated hosting, there is nothing to share, so you don’t have to worry about a greedy neighbor sucking up all of the CPU. And, since you are the only resident on this machine, it is more secure than the multi-tenant options.

However, this control and security doesn’t come cheap. Dedicated servers can be quite expensive. And since you’re dealing with a physical machine, upgrading it (to increase the amount of RAM or disk space for example) is not as simple as it is with the VPS options. The hosting company actually needs to take the machine offline, and physically install the upgrade.

Dedicated server hosting is good for web applications with high / predictable traffic, or for services that need dedicated resources (like a database server).

Application Platform Hosting

Application platform hosting (Heroku, Google App Engine) doesn’t provide a specific machine to run your web application on. In fact, you don’t have access to the machine at all. Instead, you are provided with a set of tools that can be used to deploy and manage your application. You are also given a set of services that your application can use for data storage, email delivery, etc.

Application hosting platforms are a great way to deploy a new application. They take care of all of the deployment details, letting you focus on building the application. In addition, they usually provide a free plan that works well for low traffic applications. I am currently using one (Heroku) for AuroraAlarm. Scalability is also a big feature of these platforms (this is typically when you break out of the free plan). I can only speak for Heroku , since that is the only application platform host I’ve used, but extra instances of the application can be spun up with a simple command or action on the application’s control panel UI. These extra instances can be taken offline just as easily, letting you deal with spikes in traffic, only paying for the extra capacity when you need it.

While easy to get up and running, and capable of handling large spikes in traffic, application platform hosting is not for everyone. These hosts typically have certain restrictions on their environment in order to support quick scaling. Heroku, for example, doesn’t let you write to the file system. You also don’t have access to the machine, and cannot install any software that may be needed for your application. Heroku addresses this issue with a large selection of plugins, but heavy plugin usage can quickly get expensive. In addition, heavy reliance on plugins or other services only provided by the host can help lock you in to that particular vendor, making it difficult to switch hosts down the road.

Application platform hosting is good for getting web applications up and running fast with little effort, and good for web applications with low average traffic, but need to handle occasional bursts.

Cloud Hosting

Cloud hosting (Rackspace, Amazon EC2) is very similar to VPS hosting…but on steroids. With cloud hosting, you can easily setup and/or teardown individual VPS instances, letting you scale out or in as necessary. Cloud hosting companies usually provide a few different mechanisms for doing this, such as a account management dashboard or an API.

Cloud hosting provides all of the same benefits as VPS hosting, but with better scalability. If you application is designed to be deployed on more than one machine, then it is not to difficult to set it up so that you can easily scale out the parts of the application that get the most traffic, and have all of them point to the same set of shared services (the database, memcache server, etc). With cloud hosting providers, you only pay for what you use.

The biggest downside to cloud hosting is that it complicates your deployment. If you’re not careful, managing your production environment can quickly become a pain. Care must be taken to ensure that the same versions of the same software is running on all “identical” cloud boxes. Tools like capistrano can deploy your application to multiple servers at the same time. You must also ensure that everything is configured identically on all of the machines. Tools like puppet and chef can help manage this.

Cloud hosting is good for large applications that may consist of several sub applications (components) or may need to scale out to handle large amounts of traffic.

Hybrid Hosting

Hybrid Hosting (Rackspace RackConnect) is a mix of dedicated server hosting and cloud hosting, combining the two to offer the reliability of dedicated servers with the flexibility of cloud servers.

With hybrid hosting, you can keep your resource intensive services (like your database) on dedicated hardware to ensure they have all of the resources they need. Then, you can deploy the parts of your application that need to scale on demand to the hybrid hosting company’s cloud. All machines can reside in the same data center, allowing for low latency between the dedicated servers and the cloud servers.

However, with hybrid hosting, you get the downsides of dedicated hosting and cloud hosting as well. Like dedicated server hosting, hybrid hosting is not cheap. And, as with cloud hosting, deployment of your application becomes much more complicated. Even more so with hybrid hosting, because you need to determine which services get deployed to the dedicated servers, and which services get deployed to the cloud servers.

Hybrid hosting is good for large applications with services that require dedicated hardware AND the ability to scale out other services on demand to deal with large amounts of traffic.


There are a ton of options out there when it comes to web application hosting. Each type brings its own strengths and weaknesses to the table, so choosing the right one really depends on your application. Once you determine the right type of hosting, the next step is identifying a hosting company to work with. There is no shortage of options there as well, but we’ll save that discussion for another day…

Introducing AuroraAlarm

I just finished up work on my latest side project, AuroraAlarm.

AuroraAlarm is a FREE service that will notify you via SMS when conditions are optimal for viewing the Aurora Borealis in your area.

I have always enjoyed watching the weather and the stars. However, I’m not much of a night owl, and every attempt I’ve made to stay up late enough to see an Aurora after a solar event has failed, miserably. Time and time again I would fall asleep, and wake up to find that the conditions were great for viewing an Aurora the night before.

I wanted something that would wake me up if there was a possibility of seeing an Aurora. So, I created AuroraAlarm to do just that.

How it works

Every evening, AuroraAlarm checks to see if any solar events have occurred that could trigger a geomagnetic storm, which can produce an Aurora. If a solar event has occurred, it will send a text message notifying all users of the event, asking if they would like to receive a text message during the next few nights if conditions are optimal for viewing the Aurora.

If they indicate that they would like to be notified, AuroraAlarm will send them a text message at night if conditions are optimal for viewing the Aurora in their area.

What are optimal conditions?

  • Dark. Likely in the hours past midnight.
  • A Kp index strong enough to push the Aurora down to their geomagnetic latitude.
  • Clear skies.
  • A dark moon.

The goal

I have no idea if the aurora will be visible from where I live. Honestly, I think there may be a bit too much ambient light for me to see it. But, with the help of AuroraAlarm, at least I’ll be able to find out. And, I’m really not too far (about a 15 minute drive) from some REALLY dark areas. I certainly wouldn’t rule out making this short trip if it allowed me to see one of nature’s most fantastic displays.