TL;DR: What’s not to like?
Orthogonal database schemas prevent hard-to-find bugs by:
- Simplifying your backend code, and
- Making it mathematically impossible to screw up your data!
The gist of the principle of orthogonality is that each piece of information should occur in the database exactly once. Let’s use an example to see why this is a good thing.
Players and teams
For this hypothetical example, let’s say we have a Players table and a Teams table. A player can belong to multiple teams, and each team has multiple players. We can express this using a join table. Each record in the join table simply links a player record to a team record.
Next we decide to tighten up data integrity by setting a unique constraint on (player_id, team_id)
to make sure the same relationship can’t be entered twice. Additionally, we make both player_id
and team_id
not nullable.
Our data is orthogonal. Each player-team relationship is expressed only once: by a record in the PlayerTeams table. Because of the constraints we have imposed, it is impossible for any data duplication to occur.
How it goes wrong
Now we refine our requirements a little. Although a player can belong to multiple teams, we also want to record which of these teams is the primary team for that player. In other words, which team do they spend most of their time playing on?
Perhaps we decide that an easy way to do this is to create an extra column on the Players table called primary_team_id
. This column has a foreign key constraint which ensures it links to a record in the Teams table. We implement this quickly and life goes on.
Some time later, we get reports from some users that their dashboards don’t make sense. After some investigation, we realise that they have left the team that was their primary team. Usually, we have cleanup code in our backend application that resets their primary team when this happens, but these users presented a particular edge case that we hadn’t foreseen and didn’t make test coverage for. Our cleanup code hasn’t run, and now we have inconsistent data.
Now we not only need to fix the bug; we also have to trawl through the database fixing all the inconsistent data!
Prevention is better than cure
How could we have prevented this? Well, we could have spent more time in the development phase, thinking of all the edge cases that might occur and ensuring they are all handled and included in our test coverage.
But we are human. We aren’t going to think of everything. What if we had a database schema that made inconsistent data not just unlikely, but actually impossible?
When we introduced primary teams, we made the data non-orthogonal. There were now two separate associations linking the Players table to the Teams table. The bug happened because one of these associations had been updated while the other was forgotten.
Let’s remove that primary_team_id
column from the Players table. Instead, since we know all a player’s teams from the records in the PlayerTeams join table, we just need a way of marking one of those records as the primary one for that player.
Let’s add a new column to that join table. As I’m using PostGreSQL, I can make this a boolean column. Let’s call it primary
. If it’s TRUE
, then the team specified in that record is the primary team for that player. Otherwise, it’s just one of the teams that the player is a member of.
We’re not done yet. This schema doesn’t prevent a player from having multiple teams, with all of them marked as primary. One way of fixing this is to use a unique constraint on (player_id, primary)
. Nulls are counted as distinct so, if we leave this column blank unless the record is a primary team, then the constraint will allow a player to have any number of teams, but only one primary. As a final touch, we could use a check constraint to disallow a value of FALSE
in this column.
Our data is orthogonal once again. The primary team is expressed in the same record as the player-team relationship it refers to.
How is this better?
Now that primary
is just a flag attached to a PlayerTeams record, it is impossible for the primary team to point to a team that the player doesn’t belong to.
If any weird edge cases cause a bug in our application code, we will get an immediate and obvious failure as the database rejects the update. We only need to fix the bug; not months of bad data that have crept in unnoticed.
Also, remember that cleanup code I mentioned earlier?
You can delete it now.