Rails migrations for schema vs. data


#1

Hi everyone!

I gave a talk at the #vanruby meetup last month where I shared some considerations for handling database migrations for a production Ruby on Rails stack (without bringing everything down in the process). The following sequence was something I talked about. It’s nothing terribly novel, but it’s part of the techniques we use at Guestfolio:

  1. Start with adding new empty columns or tables first. Deploy that, so the storage is available before any new code that depends on it.
  2. Deploy new code that writes to your new fields, so new records start creating correctly.
  3. Backfill data into the new fields, so older records are consistent too.
  4. Release new code that exposes/consumes/reads from the new fields.
  5. Clean up: remove any temporary triggers/duplicate writes, old tables/columns, etc.

I received some great questions about this, especially how we implement item 3 (see example from my slides below). There’s some common wisdom against the idea of performing data changes within the sequence of migrations: that migrations are only for schema changes, not data. This is something that I’ve previously read elsewhere online too, but after some back-and-forth using different approaches in our app over the years, we determined that the changes to the representation of data need to be handled in conjunction with schema changes, so using the same mechanism for them makes the most sense.

I didn’t get a chance to unpack this discussion further at the meetup, but I’m curious to hear more, since everyone’s experience is different. Here is the list of “pros” I would give for using migrations to manage data migrations:

Guaranteed to be applied exactly once

ActiveRecord migrations keep a log of what has been applied in the schema_migrations table. So, doing systemic/structural data changes using an out-of-band method (rake task etc.) means it needs to be tracked separately: who will remember to run this after deployments, or who will know where it’s already been run?

Rails will tell you that you have pending migrations, but it won’t tell you that you haven’t finished making the data consistent with the code that expects to read it when you use a separate rake task.

Guaranteed to be applied in sequence

Data changes rely on prior schema changes being applied, and using numbered migrations means the prerequisites get applied first. Later migration steps could change or remove the older storage locations, and this must not happen before the data has been converted/moved.

Uses your regular database migration process

A standard command like rake db:migrate can be launched from a CI/CD tool or whatever you normally do (capistrano, etc). Creating arbitrary commands for data changes probably means opening production console access or taking different unique steps to complete it. (Less secure, more opportunities for human error.)

Can easily use migration-specific model versions

One thing I absolutely agree doesn’t belong in migrations, and would strongly advise against is accessing application models. Your application code gets updated regularly, but migrations are fixed in time against a version of the database. (e.g. a validation could be added for a column that didn’t exist in a previous migration, which would now prevent it from running.) The database schema is changing before/after each migration step too, but models don’t necessarily get reloaded. The new schema might need to redefine some behaviour on the model class, so the current schema version and the current model version will never agree.

However, every Rails migration is its own ruby module, and we can take advantage of that by defining temporary case-specific models under its namespace, like this:

class BackfillCancelledAtColumn < ActiveRecord::Migration

  class Reservation < ActiveRecord::Base
    scope :old_status, -> { where(status: "cancelled", cancelled_at: nil) }
  end

  def up
    Reservation.old_status.find_each do |r|
      r.cancelled_at = r.updated_at
      # ...
    end
  end

end

This version of the model is not the same class as the full ::Reservation from app/models, so it doesn’t have all of its unwanted baggage (associations, validations, callbacks, etc.), and is freshly defined from the state of the database at this point in time. This version also doesn’t know about similar model versions from other migrations.

Defining only the subset of models we need like this gives us all the helpful ActiveRecord features like find_each / find_in_batches for traversing big tables. This is helpful when the data change can’t just be done with a basic connection.execute sql statement.

Writing separate rake tasks doesn’t provide the same version-specific isolation either, so doing it this way we no longer run into model conflicts with data handling in migrations. Was conflicting model functionality the main reason why independent rake tasks are preferred over migrations? I’m still curious to hear what’s to recommend about other methods for migrating data independently from schema changes, and how you manage it.

Let me know what you think!


#2

Thanks for writing this up andrew!

I agree 100% with you. Things are way easier when data migrations are part of the rails migration. Devs just have to run bundle && rake db:migrate to keep their environment up to date. If you have a separate rake tasks or scripts, then you have to tell everyone than the next time they pull changes from your branch / master they’ll have to run an extra command to get things in order… and you can be certain that half of the team will ask why their environment is broken at some point. :slight_smile:

Now, an alternative to redefining scopes is to just use the rails api - well, basically copy the scope’s code over to your migration. And once everybody has ran the migration, you can just delete all the code to ensure that you can run all migrations from scratch.

At Kickstarter, some tables have millions of entries, so running data migrations as part of the rails migrations is not really an option as it can take hours to days to run and it can impact performance dramatically. We have a simple script to wrap data migration that makes them run slowly (ex: limit to 10 updates per second), track progress and enables us to restart the migration where it stopped. It’s not fancy, we just run it in a screen session and hope that nobody restarts the server. :wink: But it does the job!