Earlier this year, we took Notion down for five minutes of scheduled maintenance. While our announcement gestured at “increased stability and performance,” behind the scenes was the culmination of months of focused, urgent teamwork: sharding Notion’s PostgreSQL monolith into a horizontally-partitioned database fleet.
If you’ve never sharded a database before, here’s the idea: instead of vertically scaling a database with progressively heftier instances, horizontally scale by partitioning data across multiple databases. Now you can easily spin up additional hosts to accommodate growth. Unfortunately, now your data is in multiple places, so you need to design a system that maximizes performance and consistency in a distributed setting.
We decided to implement our own partitioning scheme and route queries from application logic, an approach known as application-level sharding. During our initial research, we also considered packaged sharding/clustering solutions such as Citus for Postgres or Vitess for MySQL.
Difficult lessons learned
While much of the sharding project captured Notion’s engineering team at its best, there were many decisions we would reconsider in hindsight. Here are a few examples: