Optimizing SQL queries performance

Posted 7. July 2020. 7 min read.

Nobody likes waiting SQL queries, but the good news that we can make it faster, here i will describe my journey with learning how to optimizing SQL queries for a good performance.

I will describe:

How a SQL database executes a query?
Operations in an execution plan
Indexes
- What is an index?
- What are the types of index?
  - B-tree
  - Hash
  - Bitmap

So let's start our journey

How a SQL database executes a query?

A query on a large table can be done without scanning the full table, also a join a cross multiple tables can be done without comparing each row in both tables to find matches.

Depending on the details of table and the indexes we have, Database optimizer considers many techniques to improve lookups involved in a SQL query.

This set of operations optimizer chooses to perform the most effective query is called Execution plan also known as the EXPLAIN plan.

Execution plan helping us see what is happening when running a specific query, tells us if we have a well optimized query, and showing us how the optimizer run this query.

Operations in execution plan

We have a multiple operations execution plans can do, the first and simplest one is scan operation, here it's going to access each row in the table, it also can scan indexes which are a spacial kind of tables.

Scanning is a linear operation O(N), it moves from one row to the next one and applying some filters like where clause.

Scan operation can be like:

Scanning looks at each row in a table.
Apply filtering, join, condition in a where clause or any other operation.

but now when scanning is efficient?

As it's a linear operation so the cost of the queries based on the number of rows-we will know how to optimize this next-.

Scanning is efficient in these cases:

When we are scanning small tables.
Scanning large table can be efficient if the queries against the table operate on most of the rows in the table or its just a few queries-scanning large table repeatedly is not efficient-.

Scanning the full table is called Full table scan. If we see a full table scan in a query execution plan it's worth looking why a full table scan was chosen and seeing if there isn't a more efficient to retrieve the data.

Creating indexes is a good way to avoid full table scans and make execution plan more efficient. So what are indexes? let's see.

Indexes

Practically indexes are also a type of tables, but they are just used to speed up the queries.

What is an index?

Indexes are ordered subset of data in a table, They are ordered in such a way that makes it efficient to look up a row by a particular value. an entry in an index contains the data value that is the basis for the index and a pointer to the location of the corresponding row,

also indexes are smaller tables which make it efficient and faster in queering data as it can fit into memory which is really great as reading data from memory is much faster than reading data from SSDs or HDDs.

it also help enforce constraints for example unique constraint on a column will tell us when we are going to a duplicate value. The big advantage of indexes that they reduce the need for full table scans.

So if our where clause reference the index value, we can use the index to find the rows that meet the criteria rather than scanning the entire table.

We should know that indexes uses storage space, but their contribution to effect query processing outweighs the cost of additional storage.

Also this is not always the case. for example if most queries requires full scan on the table , then the index may not be used.

Index data does have a structure, let's see what are the types of index.

What are the types of index?

Databases differ on the kind of indexes they support, Some commonly use indexes are:

Balance trees or B-tree.
Hash indexes.
Bitmap indexes.

B-tree (Balance tree) index

As the name implies, B-tree index is a tree data structure with a root and nodes, the tree is balanced because the root node is the index value that splits the rang of values found in the index column.

It will be something like this.

                                       50                                             
                            ____________|________________________                       
                            |                                   |                       
                            25                                  75                      
                    _______|___________            _____________|___________            
                    |                 |            |                        |           
                    13                37           63                       87          
                ____|____                       ___|____                                
                |       |                       |       |                               
                3       5                       54      66

Each node in the left side will be less than the part node, and each one in the right side will be greater than the parent node. B-trees make efficient lookups because we can determine where in a tree a node is located by looking at the node value and branching to the left or to the right until we find the value in the tree.

B-trees is the default index for most storage engines in Mysql, It's used when there is a large number of distinct values in a column which is called high cardinality, It's also rebalanced as needed to keep the depth of the tree about the same for all paths which prevents a lopsided tree that would be fast to search on one side and slower on the other side, a lookup in the B-tree index depends on the depth of the tree it's logarithmic time.

Let's take an example: This is the schema we have.

CREATE TABLE people
(
    last_name     varchar(50)             not null,
    first_name    varchar(50)             not null,
    email         varchar(50)             not null,
    date_of_birth date                    not null,
    gender        enum ('male', 'female') not null
);

Creating an index will be like:

create index idx_people_search on people (last_name, first_name, gender)

Now if you tried to explain select query using where condition contain all these fields

EXPLAIN SELECT *
        FROM people
        WHERE last_name = 'Abohassan'
          AND first_name = 'Mohamed'
          AND gender = 'female'

You will find it using the the index we have created.

Hash index

The basis of this kind of index is a function called the hash function. Hash function takes an arbitrary length and map it to a fixed size value. Hash functions are designed so that different inputs produce different outputs. Occasionally two different inputs will produce the same output, but there's highly unlikely. in General even a slight change in input produce a different hash value. The hash value that is created can vary in size depending on which hash algorithm is used. There is no order preserving operations with hash functions

Here are a few things that we need to keep in mind about hash indexes.

They are used in only equality comparisons, They won't help if you won't to filter on a range of values
They are smaller than B-tree indexes but still as fast as B-tree indexes, which will be an advantage if we keep the entire hash index in memory.

But as we say before Hashes are not unique, Collisions can happen and they will slow down the lookup as Database will have to check more than a single row per index lookup

Let's take an example:

Let's create an index on people table on the emails of the users.

Creating an index will be like:

CREATE INDEX idx_people_email ON people (email) USING HASH

Now if you tried to explain select query using where condition contain email field

EXPLAIN SELECT *
        FROM people
        WHERE email = 'mohamed.abohassan.18@gmail.com';

You will find it using the the index we have created.

Bitmap index

Bitmap index is a special index that is well suited for the low cardinality columns. It stores a series of bits for indexed values, the number of bits used is the same as the number of distinct values in a column.

For example a column like gender in people table, it contains male or female, this will require two bits, one is corresponding to the male and one is corresponding to the female. As you see it's not restricted to boolean,

So Bitmap is used in low cardinality columns, or for performing boolean operations, such AND, OR, and NOT.

While Bitmap operations are fast, but updating Bitmap indexes can be more time consuming than other indexes, they tend to be used in read-intensive use cases like data warehouses.

But Bitmap indexes is not supported in MySQL(MySQL 8.0) till today, Postgres can calculate it on the fly, and Oracle supported it.

Summary

We have seen what is an execution plan and how it can help us achieving the best performance. Then we talked about the what is a full table scan and how it affect our query performance, Then we discussed how indexes and help us solve full table scan, and types of indexes we have, and in which scenario each type can give us the most efficient query.

That's it for today. Hope you enjoyed reading this. Thanks.

What is next?

I will take about:

Algorithms used in joining tables
Partitioning data
Materialized view

I'm a Senior Backend Engineer at Kaufland-Ecommerce. I work with PHP, GoLang, NodeJS, etc..

Linkedin GitHub