Amazon Redshift deep dive (Part 1): Core design considerations

Chayan Shrang Raj
10 min readAug 2, 2024

--

In this series of blogs, we’ll incrementally peel the layers of Amazon Redshift architecture, revealing the sophisticated mechanisms and innovative techniques that drive its unparalleled performance. Whether you’re a seasoned professional or a curious newcomer, this deep dive will equip you with the knowledge to leverage Redshift to its fullest potential. We will cover the whole engineering life cycle from understanding core design ideas to optimizing query performance.

Amazon Redshift data warehouse ecosystem

Table of contents:

  1. Core architecture components
  2. Data Processing (Massively parallel processing)
  3. Data Storage (Columnar data storage)
  4. Data compression (Compression Encodings)
  5. Data Retrieval (SQL Query Processing)

I have spent a good amount of time working with SQL queries in Redshift clusters, completely devoid of the knowledge about “How do I get this result set?”. One day, my brain could not handle this unknown territory anymore and I started scavenging some time after my work to understand the theory behind Redshift design ideas and architecture. It enabled me to skip the fluff and understand Redshift’s architecture that changed the way I thought about writing my SQL queries and designing tables.

1. Core Architecture Components

Let’s dive in. There are many moving parts under the hood that are computed blazingly fast because of top-notch networking capabilities and seamless interactions.

High level overview of an Amazon Redshift cluster architecture

Amazon Redshift is a RDBMS and has two offerings - Serverless and Provisioned. We are going to talk about Amazon Redshift Provisioned. The image above describes a typical ecosystem surrounding the data warehouse. One key element is the separation of compute, acceleration and storage layers that introduces elasticity to the whole engine.

Client applications:
Amazon Redshift seamlessly integrates with a wide array of data loading and ETL (extract, transform, and load) tools, along with business intelligence (BI) reporting, data mining, and analytics tools. Built on the foundation of open-standard PostgreSQL, it ensures that the majority of existing SQL client applications can be used with minimal modifications.

Cluster:
The core infrastructure component of an Amazon Redshift data warehouse is a cluster. A cluster is composed of one or more compute nodes. If a cluster is provisioned with two or more compute nodes, an additional leader node coordinates the compute nodes and handles external communication. Your client application interacts directly only with the leader node. The compute nodes are transparent to external applications.

Leader Node:
The leader node is responsible for receiving queries from client applications. It parses these queries and creates execution plans. The leader node then orchestrates the parallel execution of these plans across the compute nodes. It collects the intermediate results from the compute nodes, aggregates them, and finally sends the results back to the client applications.

Compute Nodes :
These nodes run the query execution plans and transmit data among themselves to serve these queries. The intermediate results are sent to the leader node for aggregation before being sent back to the client applications.

Node slices:
A compute node is divided into slices, each of which is assigned a portion of the node’s memory and disk space to handle part of the node’s workload. The leader node is responsible for distributing data to these slices and delegating portions of queries or other database tasks to them. Each slice then operates concurrently to process the assigned workload. The number of slices within a node is based on the cluster’s node size.

Redshift Managed Storage:
It comes into play based on the type of compute node chosen. There are two compute node types, namely, RA3 and DC2, Amazon Redshift managed storage uses large, high-performance SSDs in each RA3 node for fast local storage and Amazon S3 for longer-term durable storage. We will discuss more about this in the next section.

Databases:
A cluster contains one or more databases. User data is stored on the compute nodes. Your SQL client communicates with the leader node, which in turn coordinates query run with the compute nodes.

Amazon Redshift is based on PostgreSQL Amazon Redshift is a relational database management system (RDBMS), so it is compatible with other RDBMS applications. Although it provides the same functionality as a typical RDBMS, including online transaction processing (OLTP) functions such as inserting and deleting data, Amazon Redshift is optimized for high-performance analysis and reporting of very large datasets.

2. Massively Parallel Processing (MPP)

Amazon Redshift uses MPP architecture to quickly process data, even complex queries and vast amounts of data. Multiple compute nodes run the same query code on portions of data to maximize parallel processing.

High level overview of MPP Architecture

Parallelism —
Massively Parallel Processing (MPP) systems enhance processing speed by breaking down large tasks into smaller sub-tasks, which are then executed simultaneously across multiple processors. This approach leverages the power of concurrent processing to significantly boost overall performance and efficiency.

Shared-Nothing Architecture —
Each processor in an MPP system typically has its own memory and storage, reducing contention and improving performance.

Data Distribution —
Data is distributed across multiple nodes in an MPP system, it may be the case that no other compute node is present in which case only leader node will be present. It enable parallel processing, and efficient data distribution is critical for performance.

Fault Tolerance —
MPP systems often incorporate redundancy and fault tolerant mechanisms to ensure system reliability. To avoid data loss, every node has a replication factor that enables it to keep multiple copies of the same data, avoiding the risk of having single point of failure.

3. Columnar Data Storage

In a typical relational database table, each row holds information for a single record. In row-wise storage, data is stored in blocks, with each block containing the values of all columns for one row. If the block size is smaller than the record, the record might be split across multiple blocks. If the block size is larger than the record, some space in the block might go unused, wasting disk space.

For online transaction processing (OLTP) applications, most transactions involve reading and writing all values for one or a few records at a time. Therefore, row-wise storage works best for OLTP databases.

Records from database tables are typically stored into disk blocks by row.

In columnar storage, each block of data keeps the values for just one column across many rows. When new records come in, Amazon Redshift automatically changes the data into this column-based format for each column.

Each data block can hold up to three times more column values than row-based storage. This means that to read the same amount of data, columnar storage only needs one-third of the I/O operations compared to row-based storage.

In a data warehouse, queries usually need just a few columns but many rows of data. This means that reading this data needs fewer I/O operations and less memory compared to processing data stored row by row.

With columnar storage, the values for each column are stored sequentially into disk blocks.

Block storage is a technology that breaks data, like files or database entries, into equally sized blocks. These blocks are stored on physical storage in a way that allows quick and easy access. Think of block storage as a direct route to the data. On the other hand, file storage has an extra layer, such as a file system (NFS, SMB), that must be processed before accessing the data.

Typical database blocks are usually between 2 KB and 32 KB. Amazon Redshift, however, uses 1 MB blocks. This larger size makes it more efficient and reduces the number of I/O requests needed for loading data or running queries.

4. Compression Encodings

Columnar storage facilitates not only efficient query processing times but also plays a crucial role in database storage management. We discussed in previous topic how one column is stored as a whole in block storage, also it is safe to assume that a particular column will have similar values such as range, data types, data values, etc.

Compression is a column-level operation that reduces the size of data when it is stored. Compression conserves storage space and reduces the size of data that is read from storage, which reduces the amount of disk I/O and therefore improves query performance.

Compressions can be applied both automatically and manually. I was deeply interested in understanding about column compressions, hence I will share my wisdom here.

A compression encoding specifies the type of compression that is applied to a column of data values as rows are added to a table. Often columns are populated with values of diverse data types such as string, numeric, integers, date, timestamp, etc. This ensues a favorable condition that we will have the choice of choosing among multiple encodings offered in Redshift. This is how we can create column compression.

CREATE TABLE table_name (column_name 
data_type ENCODE encoding-type)[, …]

Types of encodings:

  1. RAW — Raw encoding is the default encoding for columns that are designated as sort keys and columns that are defined as BOOLEAN, REAL, or DOUBLE PRECISION data types. With raw encoding, data is stored in raw, uncompressed form.
  2. AZ64 — At its core, the AZ64 algorithm compresses smaller groups of data values and uses single instruction, multiple data (SIMD) instructions for parallel processing. Use AZ64 to achieve significant storage savings and high performance for numeric, date, and time data types.
  3. Byte-dictionary — a separate dictionary of unique values is created for each block of column values on disk. This encoding is very effective on low cardinality string columns. This encoding is optimal when the data domain of a column is fewer than 256 unique values.
  4. Delta — Delta encodings are very useful for date time columns. Delta encoding compresses data by recording the difference between values that follow each other in the column. This difference is recorded in a separate dictionary for each block of column values on disk.
  5. LZO — LZO encoding provides a very high compression ratio with good performance. LZO encoding works especially well for CHAR and VARCHAR columns that store very long character strings. They are especially good for free-form text, such as product descriptions, user comments, or JSON strings.
  6. Mostly — Mostly encodings are useful when the data type for a column is larger than most of the stored values require. By specifying a mostly encoding for this type of column, you can compress the majority of the values in the column to a smaller standard storage size. The remaining values that cannot be compressed are stored in their raw form.
  7. Run length encoding — Run length encoding replaces a value that is repeated consecutively with a token that consists of the value and a count of the number of consecutive occurrences (the length of the run). A separate dictionary of unique values is created for each block of column values on disk.
  8. ZSTD — It works especially well with CHAR and VARCHAR columns that store a wide range of long and short strings, such as product descriptions, user comments, logs, and JSON strings. Where some algorithms, such as Delta encoding or Mostly encoding, can potentially use more storage space than no compression, ZSTD is very unlikely to increase disk usage.

If you choose to apply compression encodings manually, you can run the ANALYZE COMPRESSION command against an already populated table and use the results to choose compression encodings.

5. Query Processing

When you send a SQL query to Amazon Redshift, it first goes through a parser and optimizer, which create a plan for how to handle the query. Then, the execution engine turns this plan into code and sends it to the compute nodes to be carried out.

Provides a high-level view of the query planning and execution workflow

Execution steps:

  1. The query is submitted to leader node and it parses the SQL.
  2. An initial query tree is produced by the parser, which is a logical representation of the original query. Query tree stores all the individual statements and organizes it to be passed on to the optimizer.
  3. The optimizer evaluates and if necessary rewrites the query to maximize its efficiency. For example, if you are inefficiently filtering the results after join, it optimizes that. This process sometimes results in creating multiple related queries to replace a single one.
  4. The optimizer generates a query plan (or several, if the previous step resulted in multiple queries) for the execution with the best performance. The query plan specifies execution options such as join types, join order, aggregation options, and data distribution requirements. The query plan is a fundamental tool for analyzing and tuning complex queries.
explain select eventname, count(*) from event group by eventname;

QUERY PLAN
-------------------------------------------------------------------
XN HashAggregate (cost=131.97..133.41 rows=576 width=17)
-> XN Seq Scan on event (cost=0.00..87.98 rows=8798 width=17)

5. The execution engine translates the query plan into steps, segments, and streams:
Step
Each step is an individual operation needed during query execution. Steps can be combined to allow compute nodes to perform a query, join, or other database operation.
Segment
A combination of several steps that can be done by a single process, also the smallest compilation unit executable by a compute node slice. A slice is the unit of parallel processing in Amazon Redshift. The segments in a stream run in parallel.
Stream
A collection of segments to be parceled out over the available compute node slices.

The execution engine generates compiled code based on steps, segments, and streams. Compiled code runs faster than interpreted code and uses less compute capacity. This compiled code is then broadcast to the compute nodes.

6. The compute node slices run the query segments in parallel. As part of this process, Amazon Redshift takes advantage of optimized network communication, memory, and disk management to pass intermediate results from one query plan step to the next. This also helps to speed query execution.

Steps 5 and 6 happen once for each stream. The engine creates the executable segments for one stream and sends them to the compute nodes. When the segments of that stream are complete, the engine generates the segments for the next stream. In this way, the engine can analyze what happened in the prior stream (for example, whether operations were disk-based) to influence the generation of segments in the next stream.

When the compute nodes are done, they return the query results to the leader node for final processing. The leader node merges the data into a single result set and addresses any needed sorting or aggregation. The leader node then returns the results to the client.

Thank you for your kind attention and staying till here.

Let’s connect — LinkedIn GitHub

--

--

Chayan Shrang Raj
Chayan Shrang Raj

Written by Chayan Shrang Raj

Learning by writing. Writing by experiencing. Experiencing by taking action. GitHub - https://github.com/chayansraj

No responses yet