Hi everyone!
I gave a talk at the #vanruby meetup last month where I shared some considerations for handling database migrations for a production Ruby on Rails stack (without bringing everything down in the process). The following sequence was something I talked about. It’s nothing terribly novel, but it’s part of the techniques we use at Guestfolio:
- Start with adding new empty columns or tables first. Deploy that, so the storage is available before any new code that depends on it.
- Deploy new code that writes to your new fields, so new records start creating correctly.
- Backfill data into the new fields, so older records are consistent too.
- Release new code that exposes/consumes/reads from the new fields.
- Clean up: remove any temporary triggers/duplicate writes, old tables/columns, etc.
I received some great questions about this, especially how we implement item 3 (see example from my slides below). There’s some common wisdom against the idea of performing data changes within the sequence of migrations: that migrations are only for schema changes, not data. This is something that I’ve previously read elsewhere online too, but after some back-and-forth using different approaches in our app over the years, we determined that the changes to the representation of data need to be handled in conjunction with schema changes, so using the same mechanism for them makes the most sense.
I didn’t get a chance to unpack this discussion further at the meetup, but I’m curious to hear more, since everyone’s experience is different. Here is the list of “pros” I would give for using migrations to manage data migrations:
Guaranteed to be applied exactly once
ActiveRecord migrations keep a log of what has been applied in the schema_migrations
table. So, doing systemic/structural data changes using an out-of-band method (rake task etc.) means it needs to be tracked separately: who will remember to run this after deployments, or who will know where it’s already been run?
Rails will tell you that you have pending migrations, but it won’t tell you that you haven’t finished making the data consistent with the code that expects to read it when you use a separate rake task.
Guaranteed to be applied in sequence
Data changes rely on prior schema changes being applied, and using numbered migrations means the prerequisites get applied first. Later migration steps could change or remove the older storage locations, and this must not happen before the data has been converted/moved.
Uses your regular database migration process
A standard command like rake db:migrate
can be launched from a CI/CD tool or whatever you normally do (capistrano, etc). Creating arbitrary commands for data changes probably means opening production console access or taking different unique steps to complete it. (Less secure, more opportunities for human error.)
Can easily use migration-specific model versions
One thing I absolutely agree doesn’t belong in migrations, and would strongly advise against is accessing application models. Your application code gets updated regularly, but migrations are fixed in time against a version of the database. (e.g. a validation could be added for a column that didn’t exist in a previous migration, which would now prevent it from running.) The database schema is changing before/after each migration step too, but models don’t necessarily get reloaded. The new schema might need to redefine some behaviour on the model class, so the current schema version and the current model version will never agree.
However, every Rails migration is its own ruby module, and we can take advantage of that by defining temporary case-specific models under its namespace, like this:
class BackfillCancelledAtColumn < ActiveRecord::Migration
class Reservation < ActiveRecord::Base
scope :old_status, -> { where(status: "cancelled", cancelled_at: nil) }
end
def up
Reservation.old_status.find_each do |r|
r.cancelled_at = r.updated_at
# ...
end
end
end
This version of the model is not the same class as the full ::Reservation
from app/models, so it doesn’t have all of its unwanted baggage (associations, validations, callbacks, etc.), and is freshly defined from the state of the database at this point in time. This version also doesn’t know about similar model versions from other migrations.
Defining only the subset of models we need like this gives us all the helpful ActiveRecord features like find_each
/ find_in_batches
for traversing big tables. This is helpful when the data change can’t just be done with a basic connection.execute sql
statement.
Writing separate rake tasks doesn’t provide the same version-specific isolation either, so doing it this way we no longer run into model conflicts with data handling in migrations. Was conflicting model functionality the main reason why independent rake tasks are preferred over migrations? I’m still curious to hear what’s to recommend about other methods for migrating data independently from schema changes, and how you manage it.
Let me know what you think!