Referential Integrity With Sharded Databases

April 18, 2022

One of the database providers I use for my projects is PlanetScale, and unfortunately they're able to support ACID-compliance and referential integrity, both of which are important features of relational databases. In this post, we'll be diving into the architecture of sharded databases and trying to understand why they can't support these features of relational databases.

Background

My initial question was why it says that it's impossible to scale a PlanetScale database with referential integrity as in their documentation it states that:

The way FOREIGN KEY constraints are implemented in MySQL (or, rather, in the InnoDB storage engine) interferes with Online DDL operations. Learn more in this Vitess blog post.

Limited to single MySQL server scope, FOREIGN KEY constraints are impossible to maintain once your data grows and is split over multiple database servers. This typically happens when you introduce functional partitioning/sharding and/or horizontal sharding.

This led me to think: do FOREIGN KEY constraints affect scalability in general? and if so, how?

I do think it's important to realize that SQL table joins are pretty costly, but to my knowledge it wasn't affected much by referential integrity? Now, if we're doing something like data analysis, obviously we don't have a need for referential integrity as we'd just want to dump our data into a single table, but PlanetScale and Vitess boast about being used by big web applications such as YouTube.

This led me to be confused as to why they'd drop the FOREIGN KEY constraint as databases such as CockroachDB and Spanner still maintain referential integrity along with being scalable.

What is Referential Integrity?

Let's start with the basics, in case you're new. I'm guessing most people reading this post have a fair idea of what they're talking about, but I'll explain as a formality. In simple words, a FOREIGN KEY constraint is a database key which we can use to create relations between two different tables by referencing a column, or a set of columns. Referential integrity simply refers to the state of the database in which all values of all keys are valid.

Why is it Important?

Now that we have a bit of an idea on what they are, let's skip to the second part: why are they important?

Referential integrity is important as it keeps you from introducing new errors into your database. It is a feature often provided by relational databases preventing users or applications from entering inconsistent data into the database. This leads to improved data quality, faster development, much fewer bugs, and consistency across your application.

Why Doesn't Vitess Have it?

So, to understand why Vitess is unable to support referential integrity, we have to take a dive into the architecture of the database. Vitess is a sharded non-ACID SQL database, not a distributed ACID SQL database like the examples given above. Is this a bad thing? Not exactly. They each come with their own upsides and downsides, however a downside of this sharded structure is that they're unable to support referential integrity.

Now, you must be wondering what those terms are. Let me break them down for you: ACID is an acronym of Atomicity, Consistency, Isolation and Durability.

Here, atomicity refers to an action either completing or failing entirely — no partial completion of a transaction. Consistency refers to the transaction leaving the database in a valid state. Isolation simply means that two transactions are executed without any interference with each other, and durability means the changes of the transaction are saved. For distributed databases, something like this is important because of it's architecture revolving around globally distributed nodes, leading to them providing you with the highest level of isolation called serialisable isolation.

At a particular scale, this isolation might have some problems that create hot-spots in your database where rows/pages are heavily contended, however these databases have been used in scale at places such as DoorDash and they're able to handle an enormous amount of scale. These hot spots will create problems at any isolation level, but it's more pronounced at higher isolation levels like serialisable.

A shard is a horizontal partition of data in a database, and each shard is held on a separate database server instance to spread the load. So when we refer to a database which is sharded, we're talking about something like this. Now as I said earlier, Vitess is a sharded non-ACID SQL database, which basically means that it DOES NOT guarantee ACID properties of transactions.

Why Drop It?

Well, the problem starts when you have a MySQL database with a well-defined schema, and your service becomes popular with the problem of too many reads hitting the database. What most people do here is they start caching frequently executed queries, but the reads are no longer ACID-ic.

Along with too many reads, having an excess amount of writes to your database is a serious problem which many might face. Let's say we're ready to set fire to our pockets — we can vertically scale, adding more RAM, a 16-core processor and loads of really fast solid-state drives.

We of course still have the problem of SQL table joins increasing in complexity, so you start denormalising to avoid joins between tables.

I gave a talk at the Prisma Meetup a while ago, where I explained the fundamentals of designing a relational database. A topic I covered here was denormalization, if you're interested, be sure to check this out.

But denormalization is basically the process where you add redundant data to tables in your database, which improves performance on the cost of disk space as you're no longer using CPU power for joins. While denormalization improves the speed of reads, it's important to realize that it does make writes slower.

Nevertheless, despite all of this, our database is still slow, so we move database computations onto the client, for instance generating a UUID or assigning a date.

Even after all of this, queries will still be slow — so we keep the result of the most queried data ready in a process known as database materialization. Now reads might be faster, but writes are getting slower by the day. The only logical situation now is to drop secondary indexes.

So at this point, our database has

No ACID properties because of caching
No normalized schema
No triggers
No database computations
No secondary indexes

This paved the way for Vitess and NoSQL databases, as companies were having issues with scaling their database. The way that it was designed, they weren't able to maintain data consistency, an ACID property, when transactions spanned several different shards. Referential integrity is all about consistency when data spans across multiple shards, therefore it makes sense that they're unable to support it well.

We can go deep into the structure of NoSQL databases with no FOREIGN KEY constraint and issues that we'll face adopting that model, but that's the topic for another post.

It's not just Vitess, it's been a standard practice for sharded databases to avoid referential integrity as there is simply no other choice. In terms of the ACID model, their documentation states that they guarantee atomicity but not isolation, and even go as far as to say:

Guaranteeing ACID Isolation is very contentious and has high costs. Providing it by default would have made Vitess impractical for the most common use cases.

Let's briefly talk about what ACID isolation is. There are four levels to it (according to the SQL-92 standards), including serialisability, read committed, read uncommitted, and repeatable reads. With that being said, there are more levels of isolation, such as Snapshot isolation which isn't a SQL standard although used by several databases such as Firebase or MongoDB. If you're further interested in this, I recommend reading this post. To keep it brief, I'm not going to go over what every level of isolation does/means, but if you'd like to read more about that do check out this page from the MySQL Documentation.

ACID isolation refers to the database transactions being ACIDic, which is important as they guarantee that operations behave the way developers expect them to. I'm unsure on what they mean when they say "Guaranteeing ACID Isolation is very contentious and has high costs", but if they mean that guaranteeing ACID Isolation has high costs for any product, they're wrong. Several distributed ACID-compliant databases have the highest level of isolation (serialisable transactions) whilst still being performant with fast read/write speeds. In the context of Vitess though, they're not wrong as across multiple shards transactions cannot meet any level of isolation.

Conclusion

Concluding everything I've said in this article, technology is all about sacrifices. Some technologies do some things better than others. Even though Vitess doesn't have ACID transactions, it's far more battle tested than most other distributed databases as companies have simply been using it for a longer time. Distributed databases on the other hand have the advantage of providing you with the best of both words: scalability and features of normal relational databases. In the end, it's your choice to go with whatever best suits you and your application.

That's all I have for today. Have a wonderful day 👋