Database performance: If you're not first, you're last!
In a world where time is one of our most important commodities, we want to build apps that value people’s time. Our user’s time. Our time. If you’ve ever built and scaled an app that is powered by a relational database engine like MySQL, PostgreSQL or Microsoft SQL Server, you’ll know that things are usually cool...until they’re not! A page that used to load instantly, doesn’t anymore. You start looking into it and you trace it down to a particular query hitting your database. You scratch your head and wonder why this was fine yesterday. What’s changed?
You might be here to learn more about database indexes, but more than likely, you’re here because you have a database performance problem and want to learn more about the levers you can pull to make things better. You’ve probably come across articles or Stackoverflow questions referencing some of the key terms below and you feel overwhelmed or confused about what these things actually mean.
Even if you’ve been working with relational databases for some time and have used some of these features/concepts, you might not have a good understanding of what’s happening under the covers. If you really want to maximise the performance of your database, you'll need to lift the hood and invest in learning about the mechanics.
Even the best are on a path of discovery. Recently I saw this Tweet by Jean-Michel Lemieux (CTO of Shopify).
Hopefully, by the end of this post, you should have a decent handle on some of these concepts and understand why the change made by the Shopify team had such a big impact!
Just a quick disclaimer: Initially, my goal was to write up a document to help new engineers at Codeo (we build cool stuff like SendBoard), but it quickly turned into a long post that I thought others might get value from. My recent experience has been scaling apps on MySQL, so most of the deep dives below relate specifically to MySQL, but many of the concepts translate decently to other relational database engines. This post is by no means the deepest of dives on the topic, but more of a beginner/intermediate guide for those wanting to know more about the underlying mechanics. In writing this post, I discovered a few things I had misconceptions about, so if you’re in the know, and spot any inaccuracies, please let me know! We’re all on a journey of growth and discovery 😅
To this end, I’m reminded of this quote:
I don’t explain - I explore - Marshall McLuhan.
Buckle up 🙂
Storage 📀
Photo by Patrick Lindenberg on Unsplash
Before we kick off, you need a little bit of insight into some of the constraints database engines must overcome when storing and retrieving data from your hard drive.
Why do we need storage?
Most relational databases are ACID compliant. The D in ACID stands for durability. And this is where storage (your hard drive) comes in. In order to support durability, data needs to be stored permanently on disk. If data were stored exclusively in memory (RAM), a system crash or restart would mean losing your data 😱. After changing some data in your database, you want some guarantee that your changes will be saved and survive a database/system crash. This contract is known as durability.
To be clear, this means that after inserting/updating/deleting data using your language or client of choice, you are guaranteed that the change is persisted (unless you get an error or are using transactions and haven’t committed yet).
Constraints
When you read and write data from your hard drive, it doesn’t happen instantaneously. If you’re using a SSD, it may feel fast, but in fact, writing data to your hard drive is a factor of 1000x slower than writing data to memory. The time an operation takes is dependent on a number of factors including what else is happening at the time (who else is reading/writing to disk), the nature of the operation (read vs. write) and the amount (or size) of data involved. As with older spinning disks, SSDs have different performance characteristics when reading/writing data randomly or sequentially. In general terms, we look at IOPS (input/output operations per second) when reviewing disk performance. Different storage devices provide different characteristics insofar as IOPS are concerned.
IOPS as a metric on its own is not meaningful. It simply reports on how many disk operations (read/write) can be executed per second. Larger writes (say 1MB) might result in fewer IOPS but probably more throughput (in raw MB/s). Small writes (1KB) might report higher IOPS but lower throughput. When you benchmark your storage, you will want to ensure that you are consistent with the workload you test.
Something important to acknowledge is that it’s much faster to read/write data sequentially than randomly. In other words reading data stored sequentially on your hard drive is much faster than reading blocks of data stored all over the place. Likewise for writes. You will often see benchmarks depicting these differences when reviewing performance of hard drives.
Reading and Writing Data
Reading and writing data from your file system is not performed one bit at a time. Typically, data is written and retrieved from the file system in what is known as a block. This is the smallest unit that can be used when accessing data. The block size is the base unit of work. If anything in a block changes and needs to be flushed to disk, the whole block is written. A typical block size is 512 bytes.
You may have heard of data being accessed in pages. This has different meanings based on context, as you’ll see below.
For data mapped from a hard drive into memory, a page can be thought of as a virtual block where the page is made up of a number of physical blocks (also known as page frames). This abstraction means operating systems don’t have to think about different block sizes at the physical level because different storage devices might have different block sizes. Pages have a fixed size. For example, 4k is a common page size.
Database engines also have this concept of a page which is a higher level abstraction. Typically, these pages are larger. For MySQL, the default page size is 16KB. When you insert a row into a relational database, you are inserting it into a page. When a change happens to one or more rows in a page, the database marks the page as dirty and the entire page is then flushed to disk (which means one or more OS level pages get flushed).
You might be thinking about how database engines like MySQL achieve atomic changes (the A in ACID) if the page size is larger than the file systems page size. For example, if the database page size is 16KB and the operating system page size is 4k, and we are flushing a database page, how do database engines like MySQL ensure the page is not corrupted if the database engine or system crashes in the middle of writing a database page to disk?
MySQL solves this problem by making use of a doublewrite buffer which essentially means data is written twice. This sounds inefficient but it isn’t as expensive as you would think.
What exactly happens when a row is inserted?
Above we discovered that MySQL stores rows in a 16KB page (the size of a page is configurable). How does this work?
Well, let’s assume that you create a table called “users” with the following fields
id - primary key (all you need to know for now is that data is sorted by this value)
first_name
last_name
age
A database engine like MySQL will allocate some space to store data for this table. It doesn’t do this one page at a time. Sensibly, it creates a number of pages in 1MB increments (known as an extent).
What’s in a MySQL page? In general, a page contains a few headers, your data (rows) and then some metadata at the end of the page. For more in-depth info on the structure, see here.
So assuming we insert two rows with id: 1 and 3, we could have a page that looks like this:
The diagram shows you the extent and the how the pages are structured in an extent. An important aspect is that, sequentially, the pages have a pointer to each other in what is known as a doubly linked list. If another extent were added (lets assume all our pages were full and we need to insert another row), the last page in this extent would have a pointer to the first page in the next extent and vice versa. This is important from a database perspective because it enables the database engine to easily traverse through adjacent pages when reading from any specific page.
Another aspect to notice is how the data is stored one row after another in a page. This structure is known as a row store (as opposed to a column store). This means that data is organised by row. If it were organised by column, you would have all the id fields, then all the name fields and so on. Note that in the row store, each row is stored sequentially. This is important to understand in the context of relational databases.
So where does a new row, say with id → 2 go?
Whether or not we have a primary key, assuming there is space in the page, the record will simply be appended at the end. There is no physical ordering within a page on disk. Since the whole page is read/written to memory, there would probably be very little benefit to this. In the headers, MySQL keeps track of the min and max identifier for rows in the relevant page.
So what happens if we keep on inserting rows and run out of space in a page?
MySQL will keep writing data to a page until the page is ±94% full. Or in other words, it will keep writing data until 1/16th of the space in the page is left. At this point, it will start writing data to a new page. If the data we are inserting is sequentially ordered by its primary key value, the database engine will continue to write rows into the new page until that page is 94% full. Rinse. Repeat.
If we were writing data with a random primary key value, then it’s possible that we might end up having to write data to an older (or previous) page which only has 6% (1/16) capacity left. In this situation, MySQL will insert the value if there is space in the page, however, if there is no space, the page will need to be split in two, and a new page, possibly in another extent, will need to be used.
When a page is split, half the records in the original page are left there, and the other half will need to be moved to the new page. This results in fragmentation on disk as data is not ordered one page after another (although the doubly linked list is maintained). Assuming this continues to happen, we would end up with many pages which are between 50-100% full, resulting in inefficient reads (and writes!) as the database engine may have to read/write two pages instead of one when fetching/writing what should fit into one page.
Take a note here. This is one piece of the puzzle in understanding the change by the Shopify team and why it had such a big impact 🙂
OK, so now that we understand a little more about the data structures, let’s dig into some of the other aspects.
Primary Keys 🔑
The word key in primary key might be confusing to some. In the context of databases, think of a key as a field or column.
A primary key is a special type of index which is usually made up of a single field/column in a table and which uniquely defines each row. It cannot be null. It can be made up of multiple keys/fields and when this is the case, it is known as a composite primary key.
To be clear, a primary key is a unique identity for the row, much like your identity or social security number as an individual.
In our previous examples, id was the primary key because it uniquely defined each row. For example, we couldn’t have two or more rows with id 1. If we chose first_name as the primary key, we couldn’t have another row with first name values of Rohland or Mark.
I outlined above that a primary key is a special type of index. The diagram below presents a simplified version of how data is organised, and this structure is generally known as a binary tree. It’s a binary tree because each node has a maximum of 2 child nodes. You might have come across terms like B-tree or B+tree and these refer to balanced trees. If you’d like, you can pause here and watch a general introduction to B-trees:
Welcome back! Let’s take a look at the diagram below.
This diagram is showing how 8 rows with an integer primary key might be stored in a tree structure. The root node tells us that the index contains values between 1 and 8. On the left side of the root node, we have values 1 → 4 and on the right 5 → 8. The child nodes further split up into child-nodes until we get to the leaf nodes containing the actual data.
The above diagram is actually a B+tree because the leaf nodes contain the copy of the value being indexed (our primary key) as well as the data. This is the structure Innodb uses for MySQL. So to be clear, the leaf nodes of our primary key index contain the data itself. This is very different to other indexes, which is what makes it different/special.
You might have seen the word “clustered index” in the context of primary keys, i.e. that a primary key is a clustered key or clustered index. This simply means that the data is clustered, or co-located with the index as we see above. In the context of a B+tree, the data is physically organised on disk such that the leaf nodes contain the actual data in the order of the key with pointers between each page allowing sequential traversal.
It’s easy to see why we can’t have multiple clustered indexes now, since it would be impossible to organise the data with multiple sort orders. Think about a Microsoft Excel document with multiple columns. Sorting by one column affects the order of other columns and vice versa. In addition, considering that the data is co-located with the index, having two clustered indexes means the data has to be duplicated, doubling the storage requirements and increasing the complexity for changes (inserts, deletes etc.).
Just a note here that with Microsoft Sql Server, you can separate the concept of a primary key with a clustered key. i.e. You can choose one field to be a primary key (unique) and another field to be the clustered key (the order of the data on disk). For MySQL, this is not possible and they are one in the same.
Previously we looked at an example where we inserted a value between rows with ids 1 and 3. This didn’t really impact the storage mechanics as the data was simply appended in the page, but now that we understand that there is a higher order structure (the B+tree), there is an impact. The insert may result in the B+tree having to be rebalanced (watch the linked video above if you don’t know what this means!).
Besides the issue of random primary keys causing page fragmentation, it also causes challenges for managing the B-tree. This is why it’s generally a bad idea to choose a primary (and clustered) key which has random values like a UUID or GUID, because each row insert will have a random primary key value resulting in a high probability of reorganisation of the data on disk (page splits) and re-organisation of the B-tree.
You may have come across keywords or phrases like “auto increment” or “identity. This refers to a capability of relational database engines to generate a unique, sequentially incrementing primary key value for you. This is super useful because you can leave it up to the database engine to generate the primary key value for you, however, there are situations where you might want to define this yourself. Just remember to consider the ordering of your key and the impact on storage.
Take another note here for later when we unpack the Shopify situation I mentioned in the introduction.
Performance considerations ⚡️
In the absence of any defined structure, and assuming we simply inserted data onto disk, searching for any particular primary key value would require a comparison of each and every primary key value in each row. In the diagram above, that might not be a big issue since searching for Id → 8 would only require 8 comparisons, but scaling that up to larger datasets, this becomes a major problem. The B-Tree structure enables O(log(n)) search complexity since we simply traverse the tree. In the example above, finding Id → 8 would only require 3 comparisons (root node, then [5..8] then then [7..8] to find the page with the data in it). This might not sound like a big win until you expand the example to 1 million rows. Searching for row with id → 1,000,000 would only require ±20 comparisons vs. 1 million comparisons without our primary key index. That’s a huge difference 😅
There are other benefits to this data structure too. If we were looking for a range of values, say values 6 to 8, the values would be colocated, either in the same page or in an adjacent page linked in our doubly linked list.
So, to summarise:
Advantages:
Queries which look up a specific value/key are very fast.
Range queries are very efficient.
Disadvantages:
Since the primary key index’s leaf nodes contain the data itself (the data co-located), any changes to the primary structure (for example moving from a 32-bit integer to a 64-bit integer) means mass re-organisation of all the data and the secondary indexes (more on this just now). This is very expensive. To expand on this, think of pages as cupboards, and each row as a box you have stashed away. Imagine a thousand cupboards all packed neatly with boxes that you have ordered sequentially. Imagine if the size of each box increased by 10%. You would need to re-organise everything!
Secondary Indexes
So we've covered quite a bit of ground and learnt a little about indexes in the context of primary keys. We’ve unpacked how primary keys help us efficiently find data by a primary key value. But what if you want to improve performance when searching for data other than the primary key? This is where secondary indexes come in.
Assuming we had 10 million rows in our table we defined above, and we wanted to find everyone named “Mark”, we have no choice but to iterate over every page and every row comparing the first_name
field to the value “Mark”. This will require us to read all 10 million rows into memory and compare the value of the first_name
field with the value “Mark”. As you can imagine, this is a very expensive and slow process, especially if the table is so big that it cannot fit into memory (RAM).
The solution is to introduce a secondary index. We all have experience using these types of indexes in every day life. At the back of most books, you will find an index of key words where the value of each entry in the index is a list of pages the word appears on. Consider the page number to be the primary key, and the index at the back of the book to be the secondary index. If we wanted to find all pages where some word appears, we only have to look through a few pages at the back of the book, find the word we are looking for, and then open up the pages that are referenced in the index. The process of looking up a specific word is much faster using this approach than paging through every page in the book.
All secondary indexes are non-clustered indexes. This means they are data structures which are not co-located with the data. This means you can have many secondary indexes. We could have one for first_name, another for last_name and a third for age. As with our example with the book, instead of the leaf nodes being the data itself, the leaf nodes are copies of the primary key values. It’s important to understand this. The leaf nodes are a copy of the primary key value, so in our original example, the id value.
Storage
Non-clustered indexes in MySQL are stored in a B+tree and are sorted as per the field (or fields) on which the index was created. The leaf nodes contain a copy of the primary key value to which to leaf node corresponds to. This means they are much smaller data structures which provide the same O(log(n)) performance characteristics when searching for specific values. Any time a row is inserted, updated or deleted, all secondary indexes need to be updated.
Choosing fields
Note how a secondary index can be made up of more than one field. When more than one field is included in an index, the index is known as a composite index.
For example, one could define a secondary composite index on (last_name, first_name)
. This type of index helps us search for anyone with a specific last name, or anyone with a specific last name and first name, but cannot help us efficiently find people with a specific first name.
To understand this, consider a classic phone book. A phone book is essentially an index where the leaf node is someone’s telephone number and address. It is typically sorted by last name and then first name. It helps tremendously when one wants to:
find everyone with the last name “Smith”
find everyone with the last name “Smith” and first name “Joe”
However, the classic phone book cannot help us find everyone with the first name “Joe”.
To find everyone with first name “Joe”, you would have to page through each page in the book. We would need a different phone book sorted by first name to help us with this problem. This is why the order of fields in a composite index are important. They help you solve specific search queries and so you need to be careful how you order the fields in a composite index.
Multiple Secondary Indexes vs. Composite Indexes
In the example above with first name and last name, you might consider a solution where you have two separate indexes. One for first name and another for last name. You might think this would help with all three use cases of:
finding everyone with last name “Smith”
finding everyone with first name “Joe”
finding everyone with first name “Joe” and last name “Smith”
However, that’s not the case! When satisfying a query, database engines will only ever use a single index per table.
Database engines typically maintain statistics on the data stored to help infer which index is the most appropriate to answer a specific query. So, assuming you have separate indexes for first name and last name, and search for someone with the first name “Joe” and last name “Smith”, the database engine will need to decide on which index to use. It might choose first name, or it might choose last name, depending on the statistics it has on the data in the database. For example, if there are many Joe’s and very few Smith’s, it will most probably choose last name as the index to use because it results in fewer rows to compare (so this is more efficient).
Note, it’s possible to force a database to use a specific index. Sometimes, the query optimiser chooses the wrong index. Knowing better, you can override this. For example, in MySQL you can use an index hint to force or ignore a specific index.
Covering Indexes
You might have heard of a Covering Index. This should not be confused with Composite Indexes, although they are related. A covering index usually refers to an index used to satisfy a specific query where all the data the query is searching for, is contained in the index itself.
For example, assuming a composite index on (last_name, first_name)
, we know immediately that for every leaf node, the database engine would have immediate access to three pieces of information:
The last name
The first name
The primary key (in our case, id)
So if we have a query like this:
select id, first_name, last_name from users where first_name = "Joe" and last_name = "Smith"
There is no need for database engine to de-reference the primary key back to the data to retrieve any more information. It’s all in the index 😅. This is very efficient and the primary key index is not involved at all. This is very different to the following query:
select id, first_name, last_name, *age* from users where first_name = "Joe" and last_name = "Smith"
In this query, we are looking for age in addition to the others fields, and age is not in our secondary index. So the database engine has to take any results (copies of the primary key), and de-reference them back to the primary key to fetch the age field. This is not a covering index. Having the last_name, first_name index is still useful and efficient, but not as efficient as a covering index.
If we knew in advance that age is always going to be retrieved in these sorts of queries, one might create a composite index over the fields like this: (last_name, first_name, age)
even though we will never filter by age, we have it in the index to ensure we never have to dereference back to the primary key.
Performance considerations ⚡️
OK, so this is awesome. We now know that a secondary index helps us tremendously in optimising our searches/queries. So why not slap an index on every field and every combination that we will encounter?
With DML operations (insert/update/delete queries), each secondary index has to be updated so that the right data is contained in the index and points to the correct primary key. This can mean that inserts, updates and deletes can be very expensive, especially if the fields indexed are wide (in size) or the primary key is large (for example a GUID, an identity number (string) or a composite key).
A balance needs to be struck between these considerations and your DML performance. If you are changing data infrequently, or in batches, this might give you a license to have more secondary indexes. If you are inserting or changing data frequently, and performance is important, you need to carefully consider the impact that secondary indexes will have at run time.
Finally, memory needs to be considered. The more secondary indexes you have, the more memory (on disk and in RAM) needs to be consumed for storage and usage.
I'd like to pause here and take a look at the Shopify situation introduced at the beginning of this post. To recap:
Graphs 📈 are boring even if one of mine made the Wall Street Journal. As an engineer, much prefer 📉.
— Jean-Michel Lemieux (@jmwind) May 12, 2020
Let's talk performance at Shopify. Composite primary keys aren't the default in AR/Rails. We added and now seeing 700% speed-up in top order queries. /1 pic.twitter.com/4qvVlKnueI
Originally, Shopify had relied on a standard auto-incrementing key on their orders table, and that got them so far, but in the end, they had to make a different choice. Here's a high level breakdown by Jeremy Cole (who has witten many interesting deep dive posts on MySQL):
With an auto-incremented id, each time a new row is added it will go at the "end" of the PK/CK index structure. Given the rate at which new data is added in a system like @Shopify, this will result in each new row for a given shop ending up on a different InnoDB 16 KiB page. 😱
— Jeremy Cole (@jeremycole) May 12, 2020
So, although Shopify had a secondary index on shop_id
on their orders table, new orders (across shops) were being inserted so quickly that sequential orders for a specific store ended up in different pages on disk! This in itself is not a problem, however, when displaying orders for a specific store (for example in a report for a store owner), the database had to essentially read an entire page per order returned from the secondary index (since each order for each shop is most likely found in a different page).
Their solution (after a lot of experimentation and testing it seems) was to introduce a composite primary key made up of shop_id
and then id
. This will ultimately result in fragmentation and split pages (as folks shop randomly across shops), but the performance benefit of reading from a fraction of the pages has obviously outweighed this downside.
Other notes on Performance
Not every query can be efficiently satisfied with an index. Sad, but true.
Consider a scenario where we add an additional field to our table for employment status. Assume there are only two possible values, employed and unemployed. Let’s assume we add an index on this field and then search for all employed people. One would think that our index would be used, but generally speaking, we’ll find it isn’t. Why?
There is a property called selectivity which database engines use to determine whether an index should be used or not.
For a primary key, selectivity is at its highest possible value (one) since the number of unique values is equal to the number of rows, because each value in the primary key is unique. For our employment status, we only have two unique values - employed and unemployed. This means if we had 1000 rows, our selectivity value would be: 2 / 1000 → 0.002. The more rows we have, the lower the selectivity in this scenario.
When someone says a field has high selectivity, you know that it contains a high number of unique values. If a field has low selectivity, the field contains a low number of unique values..
In general, database engines only use an index if the selectivity is high. Why is this?
Well, let’s go back to our example of an index at the back of a book. Assume we have a book containing 10 different words, randomly organised into 100 pages and we wanted to count how many times one of the words was present in the book. If we flipped to the back of the book, the index would be very small. Each value in the index would point to a large number of pages. It would be counter-productive to look at the index, and then flip to each individual page the index references, going back and forth between the index and each page. Remember, random access doesn’t perform nearly as well as sequential access on disk, so in this scenario, it would actually be more efficient to simply sequentially scan each page for the word from the beginning to the end. So, in our example above, you might find that the database engine opts to simply use the primary key, iterating over all our rows looking for the relevant field.
Unique Constraints
A unique constraint can be applied to any secondary index, as long as it’s possible given the data already in the table. Remember that a primary key, by definition, is already unique, so this only applies to secondary indexes.
Simply put, a unique constraint simply ensures that each value in the index is unique (i.e. that the primary key doesn’t appear in more than one of the leaf nodes). We probably couldn’t do this for our (last_name, first_name) index since it’s possible to have two Joe Smith’s, but we could use it if we had a field for identity number since no two people can have the same identity number.
Unique constraints are tremendously helpful in maintaining the integrity of our data and for idempotent operations. Database engines like MySQL have keywords for ignoring unique constraint violations when inserting and updating data. For example this query:
insert ignore into users (first_name, last_name, identity_number) values ('Joe', 'Smith', '12345')
will do nothing if a row with the identity number 12345 already exist. Alternatively, we could opt to avoid using ignore into
and simply rely on the error our database client would throw if someone with that identity number existed.
Partial Indexes
A few database engines support the concept of a partial index (sometimes referred to as a filtered index). In essence, a partial index is an index with a where clause. It doesn’t index all the data, but rather a subset of the data based on the defined condition.
For example, we might want to create an index on first_name
and last_name
where the person is employed. If we were using Microsoft Sql Server, this might look like:
create nonclustered index ix_names_and_employed on users (last_name, first_name) where employed = 1
This results in a much smaller index (memory wise) and helps in situations where we have queries which match the condition in the index, for example:
select * from users where last_name = 'Smith' and first_name = 'Joe' and employed = 1
The partial index would not apply in situations where we search for users that are unemployed.
Understanding what index is used 👓
Previously, I mentioned that a database engine will generally only choose one index (per table) to use when executing a query. How can one discover what index was chosen?
When using MySQL, popping the explain keyword in front of a query will provide insight into what indexes were considered and what index was chosen for a specific query. An example:
explain select * from users where last_name = 'Joe'
Summary
I started this post with a story about how database performance often grinds to a halt without much warning. One might think that there is a big difference between executing 10 and 10,000 comparisons when filtering data, however, if data is in memory, the impact is rather small. So, assuming our table fits into memory (RAM), we tend to get away without indexes. The moment our dataset expands beyond what can fit in memory and our database engine needs to start paging to disk, we want to limit our reads as much as possible, and that’s where a well thought out indexing strategy pays dividends 💪.
So, if you’ve made it here, thanks for sticking around! Indexes are an incredibly powerful tool when designing scalable relational databases. Hopefully, you have a better understanding of the various types of indexes and how they can help you scale your database and provide a robust and performant app.
A few questions/areas you could start to explore to further your understanding are:
Are all column types indexable?
Can you create, update or remove primary and secondary indexes while you have live queries running?
Assuming you have a date field called
birth_date
and you have a secondary index defined for the field, why does this where clause not use your index?where year(birth_date) < 2000
What impact does sorting have on index performance?