SQL Performance Crash Course - Indexes
An index makes the query fast
– Markus Winand
Database Management Systems are complex, somewhat arcane, beasts.
This complexity has led many developers into believing that performance issues are unavoidable and a deficiency intrinsic to relational databases; something to be fixed by scaling vertically, horizontally, or both.
In this series of posts we’ll explore the one thing we developers SHOULD know to improve our SQL queries’ performance: how to use indexes.
What’s an index, anyways?
An index holds a copy of the data of one or many tables, stored separately from the table, in a sorted order defined by the application developer.
Indexes are implemented by combining the logarithmic scaling power of B-trees (balanced trees) and the runtime flexibility of doubly linked trees to allow for extremely quick lookups, even when querying tables with millions of rows.
They’re composed of two parts:
- The Leaf Nodes, where each node holds a few entries, with each entry holding index values and a reference to their related table row. They’re sorted as defined by the index definition, and each node holds a reference to the previous and next node.
- The Search Tree, which has a root node (as all trees), and several levels of branch nodes. At some level, depending on the amount of records in the table, branch nodes connect with the leaf nodes.
The purpose of an index is to provide an optimized access path to the table data required by the application for a specific use case.
B-Tree traversal
Each branch node has multiple entries, and each entry holds a reference to another branch or leaf node, and the biggest value contained in that node. Leaf node entries hold a reference to a table row.
Traversing the tree is performed by following an algorithm like this one:
search_results = query(value)
// Search tree, and retrieve physical rows
def query(value):
row_ids = search(root_node, value)
rows = []
for each id in row_ids:
rows += get_row_by_row_id(id)
return rows
// Tree traversal
def search(node, value):
for each entry in the node:
if entry.value >= value:
if entry.node is branch node:
return search(entry.node, value)
if entry.node is leaf node:
return follow_the_leaves(entry.node, value)
// We found nothing
return []
// Follow the leaves
def follow_the_leaves(node, value):
results = []
for each entry in the node:
if entry.value == value:
results += entry.row_id
if entry.value > value:
return results
return results + follow_the_leaves(node.next_node)
A visual example of this algorithm where there’s a single matching value on the leaf node; we’re looking for the value “42”:
The leaf nodes being doubly-linked lists is relevant for when there are multiple matches, and not every one of them is contained in the same leaf node, like in this example:
Basically, there are 3 parts to using an index to retrieve rows:
- Traversing the tree.
- Following the leaf node chain.
- Retrieving each matching row from the table.
Step 2 can be omitted if there’s a UNIQUE
constraint, and step 3 can be omitted with the use of clustered indexes, which we’ll cover in a later post.
The power of logarithmic scalability, O(log n)
In math, the logarithm of a number to a given base is the power or exponent to which the base must be raised in order to produce the number.
In a search tree, the base corresponds to the number of entries per branch node, and the exponent to the tree depth, e.g. if I have a tree with 4 entries per node, then to determine how many layers we need we just would have to calculate log₄(number of entries).
In other words, the amount of entries a tree can hold grows exponentially once you add a new layer; the biggest factor to take into account to reduce depth is the amount of entries per node.
For example, with 4 entries per node as in our previous examples:
Tree Depth | Index Entries |
---|---|
3 | 64 |
4 | 256 |
5 | 1,024 |
6 | 4,096 |
7 | 16,384 |
8 | 65,536 |
9 | 262,144 |
10 | 1,048,576 |
With 256 entries per node, as it could be the case for a filesystem with a 4KB page size:
Tree Depth | Index Entries |
---|---|
3 | 16,777,216 |
4 | 4,294,967,296 |
5 | 1.09951162778e+12 |
Scanning a table with 4 billion records would take a long long time. With an unique index, it’s only 1024 (256*4) comparisons at worst.
Index basics
Fortunately, we can get 80% of the benefit of indexing by learning about the topics we’ll cover in the following sections.
Full Table Scans
Recognizable as TABLE ACCESS FULL
in Oracle’s execution plans, and Seq Scan
in Postgres’ execution plans, these operations scan the entire table as stored on disk to evaluate a predicate.
There are rare scenarios where this is preferable, but most of the time it just means we’re missing an index, or that our previous indexes are no longer appropriate for our query.
Understanding how to get and read execution plans is vital for performance tuning.
EXPLAIN SELECT * FROM manga WHERE author_id < 500;
QUERY PLAN
------------------------------------------------------------
Seq Scan on manga (cost=0.00..470.00 rows=500 width=244)
Filter: (author_id < 500)
Ref.: https://www.postgresql.org/docs/current/using-explain.html
CREATE INDEX
We can create a new index using the CREATE INDEX
statement.
CREATE INDEX manga_author_id_idx ON manga(author_id);
Queries referencing the author_id
column in the manga
table may be able to use the index if the query planner estimates that it may help improve performance.
An index is created automatically for each table’s primary key.
Concatenated Indexes
Indexes may be composed of multiple columns. Entries will be ordered first by the first column, and then the second column, and so on.
If we had a genre_manga
table with a (genre_id, manga_id)
PK, we would already have a concatenated index for those two columns; if for whatever reason we had a synthetic PK instead, we could create an index manually as follows:
CREATE INDEX genre_manga_idx ON genre_manga(genre_id, manga_id);
This index could be used both for queries using genre_id
, and queries using genre_id
AND manga_id
, but it cannot be used for queries using solely manga_id
, as entries in the index are sorted by genre, and then by manga ID.
Basically, we would need a new index for manga_id
queries:
CREATE INDEX genre_manga_manga_id_idx ON genre_manga(manga_id);
Further Optimizations…
‘Tis a bit of a spoiler towards posts in the future, but:
- If a query uses the complete primary key, the database can skip following the leaf node chain, as there’s a guarantee of only one possible result.
- If all the required columns by the query are available in the index, the database can perform an
Index-Only Scan
, which skips the need to retrieve rows from the table (AKA the “Heap”).
There are two use cases regarding manga and genres: to search for manga of a given genre (e.g. “Science fiction” manga), and to load the genres of a given manga in a details page, for example.
Our current indexes support both use cases, and given the example is just a connecting table, it’s not relevant for now. But remember it for later!
Function-based Indexes
What if we want to do case insensitive text search?
CREATE INDEX manga_name_idx ON manga(name);
SELECT id, name
FROM manga
WHERE LOWER(name) = 'shimeji simulation';
-- Seq Scan on manga (cost=0.00..1.01 rows=1 width=36)
-- Filter: (lower((name)::text) = 'shimeji simulation'::text)
The issue here is that name
and LOWER(name)
aren’t the same thing, therefore the query planner can’t use the index in this case; we’re back to full table scans.
Fortunately, indexing the results of a function (function-based index, AKA FBI) is done almost exactly like with column indexes:
CREATE INDEX manga_name_lower_idx ON manga(LOWER(name));
We can index expressions like A+B
, functions like UPPER
or LOWER
, and deterministic user defined functions; in Postgres you’re required to use the IMMUTABLE
keyword in a function to declare it as deterministic before you can use it in an index.
Parameterized Queries
The security benefits of using parameterized queries win over any performance argument when it comes to user provided input, however, it’s important to know some important implications for other scenarios where we control the input.
- Bind parameters are not visible to the optimizer, which can cause scenarios where the current execution plan is optimal for some, but not all, runtime values.
- An execution plan cache can only be reused if the SQL statement is exactly the same. Any performance benefits we may get from using ad-hoc SQL statements must be weighted against the cost of building a new execution plan.
As a rule of thumb, we should always use bind parameters except for values that shall influence the execution plan.
The book example regards unevently distributed status codes like “todo” and “done”, where we’re going to have many many more “done” entries than “todo” entries; using the index for “todo” and doing a full table scan for “done” is probably a good idea.
Strings and LIKE filters
LIKE
filters can only use characters before the first wildcard during tree traversal.
That means that any LIKE
filter that starts with a wildcard can’t be used with the indexes we’ve seen so far; a trigram index may be needed (we may cover that one later).
When using a bind parameter in conjunction with the LIKE
expresion, most databases assume there’s no leading wildcard.
PostgreSQL and LIKE filters
In PostgreSQL we need to specify an operator class like varchar_pattern_ops
in order to use LIKE
expressions together with our indexes1.
In addition, Postgres always assumes there’s a leading wildcard when using bind parameters with a LIKE
expression, but it has the @@
operator to implement text searches2.
Indexing and JOIN
You don’t have to do anything special to handle joins. You can’t create an index spanning multiple tables, so as long as you create the appropriate indexes on each table for your given query, the DBMS will do the rest for you.
Just be wary of ORM tools and their tendency to hide performance bugs like SELECT N+1, or querying too many columns (which affects the performance of some JOIN algorithms the database uses, like Hash Joins).
However, there’s an exception to this.
Hash Joins
Hash Joins do a full table scan to load both tables (or rather the relevant columns to the query) into a hash table in order to avoid accessing the table many times.
Indexing the join columns doesn’t improve performance in hash join scenarios, instead you can improve hash join performance by:
- Indexing the columns used in the independent
WHERE
predicates. - Selecting fewer columns so the entire relation can be loaded in the hash table at once.
- Adding more constraints so less rows are loaded into the hash table.
Performance is OUR responsibility
We developers know what data is required to satisfy each use case, so we’re particularly well-suited to write appropriate indexes, and to update them when requirements change (and they always change).
Database administrators can help us write indexes and deal with the finer details of our specific DBMS, but it’s on us to be on the lookout for changes that may cause old indexes to stop working, which may cause severe performance degradation issues.
Be careful, and remember your training. And write integration tests, too >:)