Let constraints drive consistency

TL;DR #

About me #

I've been working with a large production database containing hundreds of millions of rows for more than 3 years now, and the biggest pain point has always been an inconsistency in the data shape:

So how do we prevent this? What should you do if you already have this? How can you gradually enforce data consistency in a large production database? That's what we'll cover in this post.

When adding a new column (when it's not too late): #

Always add non-null db constraints (unless completely infeasible) #

Non-null constraints ensure that critical fields are never left empty, improving data integrity and preventing unexpected errors. For example, a nullable boolean introduces a third state, so always enforce non-null constraints at the database level when possible.

Think of ON DELETE policies #

Make sure to make a conscious decision about when and how your records should be deleted if the associated records are deleted. For example, if you have a user, and a user has many posts, what should happen to the posts when the user is deleted? Should they be deleted as well? Should they be orphaned? Should they be reassigned to another user?

Make sure time is always UTC #

Always store time in UTC. When fetching the timestamp for comparison, it will be fetched in the server's timezone, not UTC, so be aware of that.

Make sure dates and times are always of the same data type #

Pick the most suitable type and use it everywhere. And by everywhere, I mean everywhere. I personally prefer using timestamp for everything.

Do not prematurely create indexes (unless absolutely necessary) #

Due to performance, storage, and complexity overheads, adding an index should always be a deliberate and well-considered decision. I prefer adding an index only when querying by that column becomes a bottleneck.

Use timestamps instead of (or alongside) status flags #

Imagine a post that can be in 3 states: draft, published, archived. Instead of using a status flag, use timestamps for each state, and an XOR constraint to ensure only one of them is set at a time.

id, title, content, draft_at, published_at, archived_at

You can also add a status flag for convenience, with the necessary integrity constraints added to it as well.

Avoid using Ecto's embedded schemas for data managed by the application (opinionated) #

Use embedded schemas only when storing raw data whose schema isn’t controlled by your application. For example, a correctly used embedded schema might represent a response from a 3rd party API. Embedded schemas are not typed by Postgres, and default values are stored in a schema struct, so when filtering on the DB level, they're not applied. This can lead to hard-to-catch inconsistencies, so it’s better to omit embeds when possible.

Additionally, it's hard to create constraints on a JSONB column, so if possible, don't use it.

Mirror validations in database #

When using Ecto's unique_constraint or validate_***** in changesets, ensure that the same constraints exist in the database. For bulk updates or inserts, changeset-level validations aren't applied, so always begin with database-level validations and either handle errors gracefully with unique_constraint or duplicate them using validate_**** inside the application.

When creating a new table: #

Consider using soft deletes by adding deleted_at column #

Use soft deletes (marking records as deleted without actually removing them) for audit trails and to maintain data consistency.

When updating an existing table: #

So here's the hardest part: you already have an inconsistent table. What should you do? The good news is you've accepted the inconsistency and want to fight it — that's the first step! The rest is easier:

  1. Identify all inconsistent records
  2. Find out why they're inconsistent and FIX all places (either code or processes) that create them
  3. Fix the inconsistency by backfilling the data
  4. Add a constraint that would prevent this inconsistency from happening again. (See the previous section for ideas)

Since you've made it this far, sharing this article on your favorite social media network would be highly appreciated 💖! For feedback, please ping me on Twitter.

Published