Deep dive into design choices for table distributions in Azure Dedicated SQL Pool (SQL DW)
A cloud data warehouse represents a contemporary approach to storing and overseeing extensive datasets within a public cloud environment.
One of the most appealing aspects of adopting a cloud technology is the rate at which the products can be shipped to the market and at such lower prices.
It facilitates rapid data accessibility and utilization, making it an ideal solution for businesses dependent on data and in need of a flexible, agile, and user-friendly infrastructure. There are many players in this field, colored majorly by Amazon, Microsoft and Google.
When preparing for my Microsoft Azure Data Engineer Associate DP-203 certificate, I went deeper into the design choices for production tables in Azure Dedicated SQL Pool, formerly Azure SQL Datawarehouse, along with Pros and Cons. One thing is clear that whenever a cloud storage technology is adopted, it is inherently designed for scalability and availability, that is incumbent upon how the data is distributed across multiple storage nodes and what are data access query patterns.
To provide scalability and efficiency, the data stored in cloud platforms reside in multiple storage nodes and are processed using decoupled compute resources. This fundamental design is somewhat central to all cloud platforms, with each providing additional unique features on top of it. To understand data management in storage layer, we need to understand its architecture.
The architecture of Dedicated SQL pool is based on nodes. When applications connect and send T-SQL commands, they interact with a Control node. The Control node houses the distributed query engine, which optimizes queries for parallel processing. Subsequently, it delegates operations to compute nodes to perform their tasks concurrently. The architecture looks something like below:
Control node acts as the brain of the system, it serves as the interface for all applications and connections, facilitating interaction. The Control node hosts the distributed query engine, responsible for optimizing and coordinating parallel queries. Upon submitting a T-SQL query, the Control node transforms it into parallel queries that operate on each distribution.
Compute nodes are responsible for delivering computational capabilities, with distributions being assigned to these nodes for processing. As you scale up your compute resources, distributions are dynamically reallocated to the currently available compute nodes. The quantity of compute nodes can vary from 1 to 60, and this is dictated by the Synapse SQL service level you have chosen.
DMS, or Data Movement Service, serves as the data transport technology responsible for orchestrating the transfer of data among compute nodes. In scenarios where parallel queries demand data movement for precise results, DMS ensures the accurate and efficient relocation of the necessary data to the designated locations.
Consider that you are tasked to setup a data warehouse on Microsoft Azure for the sales team at a large enterprise. This is a big responsibility as your architecture choices would shape and contribute to the company’s overall tech stack. Considerable heed must be given to the amount of data produced, latency, access patterns and other factors when building the data storage layer from scratch.
To optimize storage and query performance, you would want to assess your storage design choices. There are three techniques that could be utilized to achieve better performance and lower costs:
- Distributions
- Indexing
- Partitioning
Let’s not shy away from any information and discuss them in detail.
Distributions
This is the first line of technique to optimize physical data storage in a distributed environment. It depends on, among other factors:
- How large is the table? Is the size volatile?
- How often is the table refreshed? Daily? Hourly?
- Do you have fact and dimension tables in the warehouse?
A distribution serves as the fundamental entity for storing and processing parallel queries on distributed data within SQL Pool or for that matter, any distributed environment. When executing a query, Synapse SQL breaks down the workload into 60 smaller queries that operate concurrently. Each of these 60 smaller queries is executed on an individual data distribution, and each compute node is responsible for managing one or more of these distributions.
There are three flavors of table distributions in Azure SQL pool:
Hash distribution
A hash-distributed table allocates table rows among compute nodes by employing a deterministic hash function that assigns each row to a specific distribution. It is optimized for data writes, not reads. In the definition of the table, a specific column is identified as the distribution column. The hash function utilizes the values within this distribution column to allocate each row to a particular distribution. More consideration is given to hash distribution since it involves a good understanding of the data in use and has several design choices to consider than round robin and replicated.
Each row is assigned to one distribution, row assignment is calculated by a deterministic hash function, number of rows in a row group depends on the size of your data. Hash-distributed tables are effective for managing large fact tables within a star schema, demonstrating robust performance even when dealing with a substantial volume of rows.
In a Massively Parallel Processing (MPP) database, the system resources across all nodes within the cluster — comprising memory, storage, and processors — are segmented into logical entities known as data slices. In the context of hash distribution, the database manager allocates data across these slices by employing a hashing algorithm on the values within the columns designated as the distribution key in the table rows.
Considerations for using a hash-distributed table :
- The table size on disk is more than 2 GB (The choice of distribution key has a more pronounced impact on query performance for larger tables, while its influence is relatively less significant for smaller tables.).
- The table has frequent insert, update, and delete operations (which is generally the case with fact tables under star schema, in a warehouse, that can grow to really big tables).
Choosing a hash distribution key (column) for a table in an MPP database:
Hash distribution key is a user-defined choice and to select an optimal key, it is essential to gather details about the table and understand its usage patterns. A well chosen hash distribution key can optimize query performance in two key ways:
- Ensuring even distribution of table data across all data slices — This would allow optimal parallelization of query processing. Also, if the data is distributed evenly, it would make sure that the available storage space is used efficiently.
- Collecting and keeping commonly fetched rows together from different tables. The idea is to create a path of minimum movement when different tables are queried.
While co-locating related rows is beneficial, prioritizing even distribution generally offers greater advantages.
In the context of smaller tables, the choice of distribution key has a comparatively limited impact on query performance. Conversely, for larger tables, the selection of the distribution key plays a pivotal role in determining query efficiency.
Designing a system is all about balancing historical and future patterns and hence you need to understand your data well, like:
- Table definition — primary key, surrogate key, foreign key, number of generated columns, data types
- Table data — number of unique values, distribution of values, NULL values
- Table operations —JOIN, GROUP BY, DISTINCT, OVER, and HAVING clauses
It would be nice to understand the thought process with the help of an example.
Consider a table, EMPLOYEE, with the following columns:
CREATE TABLE EMPLOYEE
(
ID INT NOT NULL,
NAME VARCHAR(250) NOT NULL,
RESUME BLOB,
HIRE_DATE DATE,
MANAGER CHAR(1),
OFFICE_NUMBER INT
)
Additional information:
- The value of the MANAGER column can be only ‘Y’ or ‘N’.
- Managers would like to query against this table regularly and another table, called SALES, to generate reports of the total sales by each employee:
... WHERE SALES.SALES_PERSON_ID = EMPLOYEE.ID ...
- The company is only a year old and 50% of employees were hired last May.
Let’s talk through it and use the following logic to choose a hash distribution key for the EMPLOYEE table:
How to tell if your distribution key is a good choice?
- Determine if the table suffers with data skewness: Once data has been loaded into a hash-distributed table, examine the distribution of rows across the 60 partitions to ensure a reasonably even distribution. A slight variance of up to 10% in the number of rows per distribution is acceptable without significant impact on performance. Ways to check the data skew is depends on the specific platform.
- Check query plans that result in data movement: Efficient distribution column selection is crucial for optimizing joins and aggregations by minimizing data movement. When dealing with hash-distributed tables, it’s beneficial to have one of the join columns aligned with the distribution column or columns. This ensures that, when joining two hash-distributed tables on a distribution column of the same data type, data movement is unnecessary. Moreover, the join can leverage additional columns without incurring the overhead of data movement. To minimize data movement during such operations:
- Hash distribution should be implemented on a join column in the tables involved.
- Ensure consistency in data types for the join columns across both tables.
- Utilize the equals operator when joining columns.
- Avoid using a CROSS JOIN as the join type.
Pros —
- Can handle very large amount of data, in terabytes, petabytes range.
- Data reads are highly efficient and fast as the location of the data is already known.
- Minimizes data movement since the distributions is a deterministic function based on user’s choice of distribution key.
- Gives control to the user as the process is not random.
Cons —
- It is not suitable for fast writing since it takes time to determine the location using the distribution function.
- Expect speed when using complex queries, used for heavy workload.
- Need extensive study of the data, its types, size and access patterns. Usually takes time to setup hash distribution.
Round robin distribution
Instead of having a specific key or function to determine the where the records will be placed in a row group. The system does a round robin implementation where the assignment of rows to distributions is random, where the system goes to one distribution place with a row, then to next and so on. Unlike hash-distributed tables, rows with equal values are not guaranteed to be assigned to the same distribution.
In the hash distribution, the hash function will always take some time to determine where to place the record and hence, it will always take time to write data but would be fast to read data. This is entirely opposite for Round robin because since the allocation is random, the data loading is very fast.
Use Cases:
-You could use round robin if you are loading initial data into staging table in Synapse from outside world.
- When you have query patterns that indicates that there is no obvious joining key.
- If there is no good candidate column for hash distributing the table.
- If the table does not share a common join key with other tables.
- If the join is less significant than other joins in the query.
- When the table is a temporary staging table.
Joining a round-robin table usually requires reshuffling the rows, which is a performance hit.
You can create a table with round robin distribution as follows:
CREATE TABLE dbo.OrderTable
( OrderId int NOT NULL
, Date date NOT NULL
, Name varchar(20)
, Country varchar(20)
)
WITH
( CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = ROUND_ROBIN
);
Pros —
- It is suitable for fast reading, since the allocation is totally random.
- It works well on simple queries having equal predicates, functions.
- Works best when staging data for initial loading or for simple transformation as it loads very fast.
Cons —
- Not suitable for large datasets, unless it is not volatile in nature.
- Joins and aggregations are cost heavy as it entails data movement between compute nodes.
Replicated distribution
As the name suggests, the full copy of the table is accessible on each compute node. As you may have guessed, this technique eliminates the need for data movement between nodes before a join or aggregation. These tables work best when they are smaller in size (Something like a few GBs) since it doesn’t seem optimal to store copies of large data on each node.
The following diagram shows a replicated table that is accessible on each Compute node:
Replicated tables are effective for serving as dimension tables in a star schema. In this schema, dimension tables are commonly linked to fact tables, but they are distributed in a manner distinct from the dimension tables. Dimension tables are usually of a manageable size, allowing for the storage and upkeep of multiple duplicates. These tables store descriptive information that evolves gradually, such as customer names and addresses, as well as product details. The gradual pace of data changes results in reduced maintenance requirements for replicated tables.
When deciding whether to distribute or replicate a table, consider the nature of the queries you intend to execute. Opt for replicated tables when dealing with straightforward query predicates like equality or inequality. For queries involving complex predicates such as LIKE or NOT LIKE, distributed tables are more suitable.
Since, any modification to replicated tables triggers a copy operation to each of the compute nodes and it may affect the query performance and costs. So it becomes essential to plan out your queries.
A replicated table in an SQL pool is managed by retaining a primary version, referred to as the master version. This master version is replicated to the initial distribution database on every Compute node. In case of any modifications, the master version is updated first, followed by the reconstruction of the tables on each Compute node. The reconstruction process for a replicated table involves duplicating the table onto each Compute node and subsequently constructing the associated indexes.
You can create a table with replicate distribution as follows:
CREATE TABLE dbo.OrderTable
( OrderId int NOT NULL
, Date date NOT NULL
, Name varchar(20)
, Country varchar(20)
)
WITH
( CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = REPLICATE
);
Pros —
- Suitable when dealing with lots of joins and aggregations as replicated copies eliminates data movement.
- Good performance when used for dimension tables in a data warehouse.
- Suitable for tables with less data manipulation steps.
Cons —
- Not suitable for large tables as replication is inefficient and poor data storage choice.
- Rapid data changes drops the whole table and rebuilds the replicas and indexes, so not suitable for change heavy datasets.
As discussed, distributions is one of the techniques to optimize data storage and query performance. Next, I will dive deep into Indexing, which is a very important topic to consider because the concept applies not only to Massively Parallel Processing architectures but a pertinent practice when designing databases.
Thanks for reading! Hope it was useful!