Migrating Data – Rails Migrations or a Rake Task?

I’ve always thought that Migrations were one of Rails’ best features. In one of the very first projects I worked on as a n00b software engineer straight out of college, schema and data migrations were a very large, and painful, part of that project’s deployment process. Being able to specify how the schema should change, and being able to check in those changes along with the code that depends on those changes, was a huge step forward. The ability to specify those changes in an easy to understand DSL, and having the ability to run arbitrary code in the migration to help make those changes, was simply icing on the cake.

But, the question of how to best migrate data, not the schema, often comes up. Should data migrations be handled by Rails migrations as well? Should they be handled by a script, or a rake task instead? There are pros and cons to both approaches, and it depends on the situation.

Using Rails Migrations

One of the features of Rails Migrations is that the app will fail to startup if you have a migration that has not been run. This is good, because running your app with an out of date schema could lead to all sorts of problems. Because of this, most deployment processes will run the migrations after the code has been deployed, but before the server is started. If your data migration needs to be run before the app starts up, then you can use this to your advantage by using a migration to migrate your data. In addition, if your data migration can be reversed, then that code be placed in the migration’s down method, fitting nicely into the “migration way” of doing things.

However, there are some pitfalls to this approach. It is bad practice to use code that exists elsewhere in your application inside of a Rails Migration. Application code evolves over time. Classes come and go, and their interfaces can change at any time. Migrations on the other hand are intended to be written once, and never touched again. You should not have to update a migration you wrote three months ago to account for the fact that one of your models no longer exists. So, if migrating your data requires the use of your models, or any other application code, it’s probably best that you not use a migration. But, if you can migrate your data by simply using SQL statements, then this is a perfectly valid approach.

Using a Rake Task

Another way to migrate production data is to write a rake task to perform the migration. Using a rake task to perform the migration provides several clear advantages over using a Rails Migration.

First, you are free to use application code to help with the data migration. Since the rake task is essentially “throw away”, it can easily be deleted after it has been run in production. There is no need to change the rake task in response to application code changing. Should you ever need to view the rake task after it has been deleted, it is always available via your source control system. If you’d like to keep it around, that’s fine to. Since the rake task won’t be run once it has been run in production, it can continue to reference classes that no longer exist, or use APIs that have changed.

Second, it is easier to perform test runs of the rake task. We usually wrap the rake task code within an ActiveRecord transaction, to ensure that if something bad happens, any changes will be rolled back. We can take advantage of this design by conditionally raising an exception at the end of the rake task, rolling back all of the changes, if we are “dry run” mode (usually determined by an environment variable we pass to the task). This allows us to perform dry runs of the rake task, and use logging to see exactly what it will do, before allowing it to modify any data. With Rails Migrations, this is more difficult, as you need to rollback the migration as a separate step, and this is only possible for migrations that are reversible.

Finally, you can easily run the rake task whenever you want. It does not need to happen as a part of the deployment, or alter your existing deployment process to push the code without running the migrations or restarting the server. This gives you some flexibility, and lets you pick the best time to perform the migration.

Our Approach

Generally, we use Rails Migrations to migrate our application’s schema, and Rake tasks to migrate our production data. There have only been a few cases where we have used Rails Migrations to ensure that a data migration took place as a part of the deployment. In all other cases, using a Rake task provides us with more flexibility, and less maintenance.

Another Approach?

Do you have another approach for migrating production data? If so, we’d love to hear it. Feel free to drop it in the comments.

Note: This article has been cross posted on the UrbanBound product blog.

Managing Development Data for a Service Oriented Architecture

A service oriented architecture (SOA) provides many benefits. It allows for better separation of responsibilities. It simplifies deployment by letting you only deploy the services that have changed. It also allows for better scalability, as you can scale out only the services that are being hit the hardest.

However, a SOA does come with some challenges. This blog post addresses one of those challenges: managing a common dataset for a SOA in a development environment.

The Problem

With most SOAs there tends to be some sharing of data between applications. It is common for one application to store a key to data which is owned by another application, so it can fetch more detailed information about that data when necessary. It is also possible that one application may store some data owned by another application locally, to avoid calling the remote service in certain scenarios. Either way, the point is that in most SOAs the applications are interconnected to some degree.

Problems arise with this type of architecture when you attempt to run an application in isolation with an interconnected dataset. At some point, Application A will need to communicate with Application B to perform some task. Unfortunately, simply starting up Application B so Application A can talk to it doesn’t necessarily solve your problem. If Application A is trying to fetch information from Application B by key, and Application B does not have that data (the application datasets are not “in sync”), then the call will obviously fail.

Stubbing Service Calls

Stubbing service calls is one way to approach this issue. If Application A stubs out all calls to Application B, and instead returns some canned response that fulfills Application A‘s expectations, then there is no need to worry about making sure Application B‘s data is in sync. In fact, Application B doesn’t even need to be running. This greatly simplifies your development environment.

Stubbing service calls, however, is very difficult to implement successfully.

First, the stubbed data must meet the expectations of the calling application. In some cases, Application A will be requesting very specific data. Application A, for example, may very well expect the data coming back to contain certain elements or property values. So any stubbing mechanism must be smart enough to know what Application A is asking for, and know how to construct a response that will satisfy those expectations. In other words, the response simply can’t contain random data. This means the stubbing mechanism needs to be much more sophisticated (complex).

Handling calls that mutate data on the remote service are especially difficult to handle. What happens when the application tries to fetch information that it just changed via another service call? If the requests are stubbed, it may appear that the call to mutate the data had no effect. This could possibly lead to buggy behavior in the application.

Also, if you stub out everything, you’re not really testing the inter-application communication process. Since you’re never actually calling the service, stubs will completely hide any changes made to the API your application uses to communicate with the remote service. This could lead to surprises when actually running your code in an environment that makes a real service call.

Using Production Data

In order for service calls to work properly in a development environment, the services must be running with a common dataset. Most people I’ve spoken with accomplish this by downloading and installing each application’s production dataset for use in development. While this is by far the easiest way to get up and running with a common dataset, it comes with a very large risk.

Production datasets typically contain sensitive information. A lost laptop containing production data could easily turn into a public relations disaster for your company, and more importantly it could lead to severe problems for your customers. If you’ve ever had personal information lost by a 3rd party then you know what this feels like. Even if your hard drive is encrypted, there is still a chance that a thief could gain access to the data (unless some sort of smart card or biometric authentication system is used). The best way to prevent sensitive information from being stolen by keeping it locked up and secure, on a production server.

Using Scrubbed Production Data

Using a production dataset that has been scrubbed of sensitive information is also an option. This approach will get you a standardized dataset, without the risk of potentially losing sensitive information (assuming your data scrubbing process is free of errors).

However, if your dataset is very large, this may not be a feasible option. My MacBook Pro has a 256GB SSD drive. I know of datasets that are considerably larger than 256GB. In addition, you have less control over what your test data looks like, which could make it harder to test certain scenarios.

Creating a Standardized Dataset

The approach we’ve taken at Centro to address this issue is to create a common dataset that individual applications can use to populate their database. The common dataset consists of a series of YAML files, and is stored in a format that is not specific to any particular application. The YAML files are all stored together, with the thought that conflicting data is less likely to be introduced if all of the data lives in the same place.

The YAML files may also contain ERB snippets. We are currently using ERB snippets to specify dates.

- id: 1
  name: Test Campaign
  start_date: <%= 4.months.ago.to_date %>
  end_date: <%= 3.months.ago.to_date %>

Specifying relative dates using ERB, instead of hard coding them, gives us a dataset that will not grow stale with time. Simply re-seeding your database with the common dataset will give you a current dataset.

Manually creating the standardized dataset also enables us to construct the dataset in such a way that edge cases that are not often seen in production data are exposed, so we can better test how the application will handle that data.

Importing the Standardized Data into the Application’s Database

A collection of YAML files by itself is useless to the application. We need some way of getting that data out of the YAML files and into the application’s database.

Each of our applications has a Rake task that reads the YAML files that contain the data it cares about, and imports that data into the database by creating an instance of the model object that represents the data.

This process can be fairly cumbersome. Since the data in the YAML files are stored in a format that is not specific to any particular application, attribute names will often need to be modified in order to match the application’s data model. It is also possible that attributes in the standardized dataset will need to be dropped, since they are unused by this particular application.

We solved this, and related issues by building a small library that is responsible for reading the YAML files, and providing the Rake tasks with an easy to use API for building model objects from the data contained in the YAML file. The library provides methods to iterate over the standardized data, map attribute names, remove unused attributes, or find related data (perhaps in another data file). This API greatly simplifies the application’s Rake task.

In the code below, we are iterating over all of the data in the campaigns.yml file, and creating an instance of our Campaign object with that data.

require 'application_seeds'

namespace :application_seeds do
  desc 'Dump the development database and load it with standardized application seed data'
  task :load, [:dataset] => ['db:drop', 'db:create', 'db:migrate', :environment] do |t, args|
    ApplicationSeeds.dataset = args[:dataset]
    seed_campaigns
  end

  def seed_campaigns
    ApplicationSeeds.campaigns.each do |id, attributes|
      ApplicationSeeds.create_object!(Campaign, id, attributes)
    end
  end
end

With the Rake task in place, getting all of the applications up and running with a standardized dataset is as simple as requiring the seed data library (and the data itself) in the application’s Gemfile, and running the Rake task to import the data.

The application_seeds library

The library we created to work with our standardized YAML files, called application_seeds, has been open sourced, and is available on GitHub at https://github.com/centro/application_seeds.

Drawbacks to this Approach

Making it easy to perform real service calls in development can be a double edged sword. On one hand, it greatly simplifies working with a SOA. On the other hand, it makes it much easier to perform service calls, and ignore the potential issues that come with calling out to a remote service. Service calls should be limited, and all calling code should be capable of handling all of the issues that may result from a call to a remote service (the service is unavailable, high latency, etc).

Another drawback is that test data is no substitute for real data. No matter how carefully it is constructed, the test dataset will never contain all of the possible combinations that a production dataset will have. So, it is still a good idea to test with a production dataset. However, that testing should be done in a secure environment, where there is no risk of the data being lost (like a staging environment).

Introducing AuroraAlarm

I just finished up work on my latest side project, AuroraAlarm.

AuroraAlarm is a FREE service that will notify you via SMS when conditions are optimal for viewing the Aurora Borealis in your area.

I have always enjoyed watching the weather and the stars. However, I’m not much of a night owl, and every attempt I’ve made to stay up late enough to see an Aurora after a solar event has failed, miserably. Time and time again I would fall asleep, and wake up to find that the conditions were great for viewing an Aurora the night before.

I wanted something that would wake me up if there was a possibility of seeing an Aurora. So, I created AuroraAlarm to do just that.

How it works

Every evening, AuroraAlarm checks to see if any solar events have occurred that could trigger a geomagnetic storm, which can produce an Aurora. If a solar event has occurred, it will send a text message notifying all users of the event, asking if they would like to receive a text message during the next few nights if conditions are optimal for viewing the Aurora.

If they indicate that they would like to be notified, AuroraAlarm will send them a text message at night if conditions are optimal for viewing the Aurora in their area.

What are optimal conditions?

  • Dark. Likely in the hours past midnight.
  • A Kp index strong enough to push the Aurora down to their geomagnetic latitude.
  • Clear skies.
  • A dark moon.

The goal

I have no idea if the aurora will be visible from where I live. Honestly, I think there may be a bit too much ambient light for me to see it. But, with the help of AuroraAlarm, at least I’ll be able to find out. And, I’m really not too far (about a 15 minute drive) from some REALLY dark areas. I certainly wouldn’t rule out making this short trip if it allowed me to see one of nature’s most fantastic displays.

ReloadablePath in Rails 3

A core feature of the Signal application is support for custom promotion web forms. Custom promotion web forms allow our customers to create custom web pages that will allow their customers to interact with their promotions via the web. Our customers currently use these web forms for sweepstakes entry, email/SMS subscription list opt-ins, online polls, and more.

One of the best things about custom promotion web forms is that it allows our customers to completely control what the web page looks like. The Signal application allows for the creation of a web form theme that can be used as a template for a given promotion. The specific promotion can then customize the web page further by specifying the copy that appears on the web page, the data attributes that should be collected, and more.

The web form themes are managed by the Signal application, and are saved to disk as a view (an ERB template) when created or updated. Our customers can edit these themes at any time. When a theme is updated, we need to tell Rails to clear the cache for these specific views, so our customer will see their changes the next time they visit a web page that uses the updated theme.

In Rails 2, this was done using ReloadablePath.

class SomeController < ApplicationController
  prepend_view_path
    ActionView::ReloadableTemplate::ReloadablePath.new(
      "/path/to/my/reloadable/views")
end

However, ReloadablePath is no more in Rails 3. So, we needed to find a new solution to this problem.

Rails 3 introduced the concept of a Resolver, which is responsible for finding, loading, and caching views. Rails 3 also comes with a FileSystemResolver that the framework uses to find and load view templates that are stored on the file system.

FileSystemResolver is very close to what we want. However, we need the ability to clear the view cache whenever one of the web form themes has been updated. Thankfully, this was fairly easy to do by creating a new Resolver that extends FileSystemResolver, which is capable of clearing the view cache if it determines that it needs to be cleared.

Looking at the code for the Resolver class, you can see that it checks the view cache in the find_all method. If it does not have the particular view cached, it will proceed to load it using the proper Resolver. So, we simply have to override find_all to clear the cache if necessary before delegating the work to the super class to find, load, and cache the view.

class ReloadablePathResolver < ActionView::FileSystemResolver

  def initialize
    super("/path/to/my/reloadable/views")
  end

  def find_all(*args)
    clear_cache_if_necessary
    super
  end

  def self.cache_key
    ActiveSupport::Cache.expand_cache_key("updated_at",
      "reloadable_templates")
  end

  private

  def clear_cache_if_necessary
    last_updated = Rails.cache.fetch(ReloadablePathResolver.cache_key) { Time.now }

    if @cache_last_updated.nil? || 
        @cache_last_updated < last_updated
      Rails.logger.info "Reloading reloadable templates"
      clear_cache
      @cache_last_updated = last_updated
    end
  end

end

Since we're running multiple processes in production, we need a way to signal all processes that their view caches should be cleared. So, we're using memcache to store the time that the web form themes were last updated. Each process then checks that timestamp against the time that particular process last updated its cache. If the timestamp in memcache is more recent, then the ReloadablePathResolver will clear the cache using the clear_cache method it inherited from Resolver.

Next, we need to add some code that will update memcache any time a web form theme has been updated and saved to disk.

class WebFormTheme < ActiveRecord::Base
  after_save :update_cache_timestamp

  private

  def update_cache_timestamp
    Rails.cache.write(ReloadablePathResolver.cache_key, Time.now)
  end
end

The final step is to simply prepend the view path with the new ReloadablePathResolver.

class SomeController < ApplicationController
  prepend_view_path ReloadablePathResolver.new
end

References

  • Getting ‘rake test’ Running with Rails 3 and MongoDB

    I’ve recently started a re-write of my Addressbook application. Addressbook was my first Rails application, as well as my first full fledged Ruby application…and boy does it show! So, trust me, the re-write is justified :)

    I saw the re-write as a good opportunity to get more familiar with Rails 3 and MongoDB. After getting Rails 3 setup to use MongoDB via MongoMapper (Ben Scofield’s setup script gets you most of the way there), I quickly realized that the tests would not run.

    jwood-mbp:addressbook jwood$ rake test
    (in /Users/jwood/dev/personal/addressbook)
    Errors running test:units, test:functionals, test:integration!
    
    jwood-mbp:addressbook jwood$ rake test:units
    (in /Users/jwood/dev/personal/addressbook)
    rake aborted!
    Don't know how to build task 'db:test:prepare'
    

    Some Googling revealed this post by Yehuda Katz, in which he says:

    Additionally, users can completely remove the ActiveRecord gem, and be rid of the generators, rake tasks and other peripheral elements. An example: if DataMapper implements a db:test:prepare rake task, a Rails developer can replace ActiveRecord with DataMapper, and the test:units rake task will use DataMapper’s new tasks.

    We’re not using DataMapper here, but the same thing applies to MongoMapper. So, all that is needed is to implement a db:test:prepare task, which I did in lib/tasks/mongo.rake

    namespace :db do
      namespace :test do
        task :prepare do
          # Stub out for MongoDB
        end 
      end 
    end
    

    As of right now, the task does nothing. I may end up adding some functionality to it as I work on the project. We’ll see.

    Running the tests again, I can see that was all that was needed to get going.

    jwood-mbp:addressbook jwood$ rake test:units
    (in /Users/jwood/dev/personal/addressbook)
    Loaded suite /Users/jwood/.rvm/gems/ruby-1.9.1-p378@rails3/gems/
    rake-0.8.7/lib/rake/rake_test_loader
    Started
    ...
    Finished in 0.062854 seconds.
    
    3 tests, 3 assertions, 0 failures, 0 errors, 0 skips