Deep dive into table indexing in Azure Dedicated SQL Pools (Azure SQL DW)
It is important to understand the concept of indexing as designing efficient indexes is paramount to achieving good database and application performance. Cloud databases allow both column store and row store tables.
As we delved into table distributions in the previous article, the next technique is indexing. Database or table indexing is one of the most important concept when it comes to database design because it has a huge impact on query performance and data storage.
Column stores physically store the column data close together, so if you want to count or aggregate a column, it is really easy to go to that column in the storage and operate on it. Whereas, row stores is great for singleton lookups, or when you want all the information for a particular filter or query.
Column stores allow heavy compressions (10x) and other benefits that we are going to learn about.
In the realm of database management, an index is an on-disk framework closely linked with a table or view, optimizing the swift retrieval of rows. The indexing information is composed of keys derived from one or more columns within the table or view, and stored in a specialized structure, typically a B-tree, empowering SQL Server to efficiently locate the relevant row or rows based on the associated key values.
This technological mechanism significantly enhances the overall speed and efficiency of data retrieval processes. Indexes are automatically maintained for a table or view whenever the data is modified.
In Azure SQL Dedicated SQL Pool, there are several indexing options that could be deployed based on the workload:
- Clustered columnstore index
- Clustered index
- Non-clustered index
Without further ado, let’s discuss each in detail.
Clustered columnstore index
By default, dedicated SQL pool creates a clustered columnstore index when no index options are specified on a table.
But what is it exactly?
It could be best described visually
When the data is written into a table, the first step is to distribute the rows into manageable chunks of rows called row groups, this does not overflow the memory when loading large datasets.
Typically, a rowgroup consists of the maximum allowable number of rows, set at 1,048,576 rows per rowgroup. The columnstore index slices the table into rowgroups, and then compresses each rowgroup in a column-wise fashion.
This column-wise manner is also termed column segments which are compressed columns in each row-group. A column segment refers to a set of data arranged vertically within a row group. In a table, each row group has a column segment for every column present. These column segments are compressed and stored collectively on physical media. Additionally, each segment is accompanied by metadata, enabling swift elimination of segments without the need to read them.
Delta rowgroups incorporates the new write operations that have been made. The columnstore stores data temporarily in the delta store and builds a rowgroup when the number of rows have touched the maximum rows, it transitions from an OPEN to CLOSED state.
With clustered columnstore index, the columnstore is created on the entire table unlike on some columns like in clustered indexes.
Why should you use a columnstore index?
- Values within columns typically belong to the same domain and often share similarities, leading to efficient compression rates. This minimizes or eliminates I/O bottlenecks in your system and substantially reduces the memory footprint.
- Enhancing compression rates contributes to better query performance by reducing the in-memory footprint. Consequently, SQL Server gains the ability to execute a greater number of query and data operations in memory, leading to improved overall query performance.
- Batch execution enhances query performance, usually achieving a two to fourfold improvement, as it processes multiple rows concurrently.
- Queries typically retrieve a limited set of columns from a table, thereby minimizing the overall input/output (I/O) operations on the physical storage media.
When should you use a columnstore index?
- Utilizing a clustered columnstore index for the storage of both fact tables and extensive dimension tables in data warehousing scenarios can significantly enhance query performance and achieve up to a 10-fold improvement in data compression.
- Utilize a nonclustered columnstore index to conduct real-time analysis within an OLTP (Online Transaction Processing) environment.
- Internet of Things (IOT) workloads that insert large volumes of data with minimal updates and deletes.
Clustered index
Clustered indexes sort and store the data rows within a table or view in accordance with their key values, which are the columns, specified by the user, in the index definition. The crucial aspect of a clustered index is that it determines the physical order in which the data is stored. Unlike non-clustered indexes, there can only be one clustered index per table since the actual data rows are organized in a singular sequence.
When a clustered index is defined on a table, the data rows are not only logically but also physically ordered based on the specified key values. This not only facilitates efficient search and retrieval operations but also influences the overall storage structure.
Imagine a scenario where a table is indexed based on a timestamp column. In this case, the rows would be physically arranged in chronological order, simplifying queries that contain date-based filtering. However, it’s crucial to carefully consider the selection of columns for the clustered index, as it directly impacts how the data is stored on disk.
Rowstore indexes follow a B+ tree structure, where each page is referred to as an index node. At the top of the hierarchy is the root node, with leaf nodes constituting the bottom layer. Any levels between the root and leaves are considered intermediate.
Furthermore, the limitation of having only one clustered index per table apprises us of the importance of strategic decision-making in index design. Database administrators need to weigh the trade-offs and prioritize the most critical or queried columns for the clustered index to ensure optimal performance and efficiency in data retrieval. Understanding of clustered indexes empowers database architects and developers to make informed choices when designing and optimizing database structures.
When should you use a clustered index?
- Query patterns that return a range of values by using operators such as
BETWEEN
,>
,>=
,<
, and<=
. When the row with the first matched value is found by using the clustered index, rows with subsequent indexed values are guaranteed to be physically adjacent. - Queries that return large result sets.
- Use
JOIN
clauses; typically these are foreign key columns orORDER BY
orGROUP BY
clauses. - Where the columns are unique or contain many distinct values. If not specified already, when creating a PRIMARY KEY constraint, the Database creates a clustered index to support that constraint.
- When a column is accessed frequently to sort the data retrieved from a table. It can be a good idea to cluster (physically sort) the table on that column, to save the cost of a sort operation every time the column is queried.
When should you not use a clustered index?
- Not a good choice for columns that undergo frequent changes. This results in the whole row movement, because the Database must keep the data values of a row in physical order. This is an important consideration in high-volume transaction processing systems in which data is typically volatile.
- Tables having wide keys as they are composed of several columns or several large-size columns.
Non-clustered index
The primary purpose of nonclustered indexes is to enhance the performance of commonly executed queries, particularly those that aren’t addressed by the clustered index.
Much like how an index in a book facilitates efficient information retrieval, the query optimizer in database systems employs a similar strategy. It navigates through a nonclustered index to pinpoint the location of a specific data value within a table, allowing for direct retrieval of the desired information from that particular location.
Their B+ tree structure closely resembles that of clustered indexes, but with notable distinctions such as:
- Unlike clustered indexes, the actual data rows in the associated table are not sorted and stored in accordance with their nonclustered keys.
- In the case of nonclustered indexes, the leaf level contains index pages rather than data pages. These index pages at the leaf level encapsulate key columns as well as included columns, offering a unique structure that enhances performance and query optimization.
By default, a nonclustered index has a single partition. In case a nonclustered index consists of multiple partitions, each partition has a B+ tree structure that contains the index rows for that specific partition. For example, if a nonclustered index has 3 partitions, there are 3 B+ tree structures, with one in each partition.
The key advantage lies in the ability to tailor nonclustered indexes to specific query patterns, optimizing the retrieval process for frequently executed queries.
When should you use a nonclustered index?
- For databases or tables characterized by infrequent updates but substantial data volumes, leveraging multiple nonclustered indexes can significantly enhance query performance.
- Create multiple nonclustered indexes on columns involved in join and grouping operations, and a clustered index on any foreign key columns.
- Queries that don’t return large result sets.
- Performance gains can be achieved when the index contains all columns in the query. The query optimizer can locate all the column values within the index.
Both the clustered and nonclustered indexes are stored in B-Tree architecture, which is an interesting algorithm to perform fast data access and retrieval.
The index is basically a sorted list of keywords and next to each keyword is a set of page numbers pointing to the pages where each keyword can be found. This concept becomes more important as the data size increases along with increasing complexity in queries.
To further increase database performance, there is another level of technique, called partitions that could be utilized to filter and store data based on the data access patterns. It allows transfer or access to subsets of data quickly and efficiently, while maintaining the integrity of the data collection.
Thanks for reading!