SQL Window Functions and CTEs
Window functions are significantly different from other available SQL functions and keep a special place in a Data Engineer’s heart. It is also one of the most asked ‘advanced’ questions in job interviews as well as with the progression of your data engineering career, you will come across window functions more frequently since they would help answer some of the more complex business questions.
I am very excited to dive deeper into the working of window functions, and further expand into the nitty gritties of its operation and think about some of the key answers such as:
What is a window function?
What are the use cases of window functions?
How do you use a window function?
How a Common Table Expression helps simplify WF?
What are RANK(), DENSE_RANK(), ROW_NUMBER()?
Let’s go!
1. What is a window function?
Window functions, as the name suggests, work in windows, meaning the user has to create a window or in SQL context gather a set of rows and perform some operations on it. The operations could be simple SUM(), AVG(), MIN(), or complex LEAD(), LAG(), RANK(), DENSE_RANK(). Window functions are aware of their surroundings, like what the other rows are, what are the computations that can be performed with these sets of rows.
Regular aggregate functions take each row as a separate set. Use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.
There are some key building blocks when defining a window function. Let us understand the basics of these elements.
What is a window?
A window is a set of rows determined by the OVER
clause. Without OVER it’s not a window function. A window function call always contains an OVER
clause directly following the window function's name and arguments. OVER
clause contains a few keywords that are used to compose a window function:
- PARTITION BY — It determines how many window ‘slices’ there will be. It is confusing, I know, but let’s break it down. For example, let’s imagine you want to calculate the second highest salary in each department of your company. When you hear this, you should understand that when it says each department, it means aggregate salaries for each department and give me the second highest, so the Grouping method here should be department. So
PARTITION BY
groups these rows and perform operations on it. It is also possible to omitPARTITION BY
, in which case there is a single partition containing all rows. - ORDER BY — It simply orders by the designated column(s) the same way the
ORDER BY
clause would, except that it treats every partition as separate. It is used after partition by clause to order the random rows returned.ORDER BY
can be omitted if the ordering of rows is not important.
But you would ask, how do we know what to use window functions? So, actually it could be a bit intricate to understand from simple English but these functions are commonly linked with English expressions like Rolling Average/Rolling Sum, Rank, Ordinal numbers, Consecutive appearances, Deduplicate. So if you come across these mentions while thinking about a business problem, do explore window functions once.
Real life example
There is nothing better to actually learn something than applying the theoretical knowledge in the real world. We are working with NBA players dataset that describes the statistics of all time players in NBA such as age, height, college, team, points, seasons, etc.
SELECT * FROM nba_player_seasons
LIMIT 50
Our goal here is to find top 10 players who have had the most consecutive 20 points seasons.
As mentioned before, we could take a hint from the word ‘consecutive’ above and start to break the high level question into achievable small goals.
SELECT player_name
, season
, pts AS points
FROM nba_player_seasons
Step 1 — In order to compare the current season with the previous season, we need to be aware of the row before the current row, right? We can do this by using LAG
function and it takes one argument which is the number of previous rows to consider. Additionally, we want to know the players, so our window will be partitioned based on player_name.
SELECT player_name
, season
, pts AS points
, LAG(pts, 1) OVER (PARTITION BY player_name ORDER BY season) AS
points_last_season
FROM nba_player_seasons
Now, see the column points_last_season which also includes null values, if you notice, the rows in this column have moved one row lower and exhibits diagonal behavior. Null values will be present when there are no rows before that and it is super common. Similarly LEAD
will produce one row after it in the current row.
It is a good time to introduce CTEs here because we are going to wrap these queries in them. CTE can created using WITH
keyword. It is basically encapsulation and naming of a subquery that considerably increases readability and query debugging. So giving the above virtual table a name as ‘lagged’ and then querying out of it creates a more accessible way of working with SQL queries.
WITH lagged AS (
SELECT player_name
, season
, pts AS points
, LAG(pts, 1) OVER (PARTITION BY player_name ORDER BY season) AS
points_last_season
FROM nba_player_seasons
)
SELECT * FROM lagged
Step 1 — Now, we can define a 20 point streak when both previous and current season are 20 points or more. So, we need a conditional statement where this can be checked, that is, a marker that tracks the rows where the 20 points rule has changed.
WITH lagged AS (
SELECT player_name
, season
, pts AS points
, LAG(pts, 1) OVER (PARTITION BY player_name ORDER BY season) AS
points_last_season
FROM nba_player_seasons
)
, streak_identifier AS (
SELECT *
, CASE WHEN points >= 20 AND points_last_season >= 20 THEN 0 ELSE 1 END AS
points_stayed_above_20
FROM lagged
)
SELECT * FROM streak_identifier
The result shows 0 whenever the points have stayed above 20 in the previous season and continued in the current season, this gives access to our streak. We can see the clear comparison between ‘points’ and ‘points_last_season’ and build the streak tracker. We again encapsulated this query under a CTE called ‘streak_identifier’.
Step 3 — Next, if we count the number of occurrences of zeros in column ‘points_stayed_above_20’, we can formulate the number of consecutive seasons where points stayed above 20 for each player and for each season. Since, we want to cater to each row inside the group ‘player_name’, we will again use a window function to sum the streaks.
WITH lagged AS (
SELECT player_name
, season
, pts AS points
, LAG(pts, 1) OVER (PARTITION BY player_name ORDER BY season) AS
points_last_season
FROM nba_player_seasons
)
, streak_identifier AS (
SELECT *
, CASE WHEN pts >= 20 AND points_last_season >= 20 THEN 0 ELSE 1 END AS
points_stayed_above_20
FROM lagged
)
, streak_count AS (
SELECT *
, SUM(points_stayed_above_20) OVER (
PARTITION BY player_name ORDER BY season) AS streak_continue
FROM streak_identifier
)
SELECT * FROM streak_count
Step 4 — The result above includes season where points are below 20 also, so it needs to be excluded. It can be done using a CASE
statement where the points are above 20. We can also include the start season and end season where the consecutive 20 point rally was there. We also GROUP BY
‘player_name’ and ‘streak_continue’.
-- CTE 1
WITH lagged AS (
SELECT player_name
, season
, pts AS points
, LAG(pts, 1) OVER (PARTITION BY player_name ORDER BY season) AS
points_last_season
FROM nba_player_seasons
)
-- CTE 2
, streak_identifier AS (
SELECT *
, CASE WHEN pts >= 20 AND points_last_season >= 20 THEN 0 ELSE 1 END AS
points_stayed_above_20
FROM lagged
)
-- CTE 3
, streak_count AS (
SELECT *
, SUM(points_stayed_above_20) OVER (
PARTITION BY player_name ORDER BY season) AS streak_continue
FROM streak_identifier)
-- CTE 4
SELECT player_name
, COUNT(CASE WHEN pts >= 20 THEN 1 END) AS num_consecutive_seasons
, MIN(season) AS season_start
, MAX(season) AS season_end
FROM streak_count
GROUP BY player_name
ORDER BY num_consecutive_seasons DESC
Step 4 — Now we need top 10 players because currently there are 100s of players in the table. Here we can use RANK() because these ranking functions are especially useful when working with ties and we see above that having ties is so common in sports. We will use one more CTE to build an aggregate query. Also, as an experiment we will use all ranking functions such RANK(), DENSE_RANK(), ROW_NUMBER() and understand the difference between them.
-- CTE 1
WITH lagged AS (
SELECT player_name
, season
, pts AS points
, LAG(pts, 1) OVER (PARTITION BY player_name ORDER BY season) AS
points_last_season
FROM nba_player_seasons
)
-- CTE 2
, streak_identifier AS (
SELECT *
, CASE WHEN pts >= 20 AND points_last_season >= 20 THEN 0 ELSE 1 END AS
points_stayed_above_20
FROM lagged
)
-- CTE 3
, streak_count AS (
SELECT *
, SUM(points_stayed_above_20) OVER (
PARTITION BY player_name ORDER BY season) AS streak_continue
FROM streak_identifier)
-- CTE 4
, aggregated AS (
SELECT player_name
, COUNT(CASE WHEN pts >= 20 THEN 1 END) AS num_consecutive_seasons
, MIN(season) AS season_start
, MAX(season) AS season_end
FROM streak_count
GROUP BY player_name
ORDER BY num_consecutive_seasons DESC
)
-- FINAL QUERY
SELECT *
, RANK() OVER(ORDER BY num_consecutive_seasons DESC) AS rank
, DENSE_RANK() OVER(ORDER BY num_consecutive_seasons DESC) AS dense_rank
, ROW_NUMBER() OVER(ORDER BY num_consecutive_seasons DESC) AS row_nume
FROM aggregated
We see LeBron James with maximum consecutive 20 point season which is so good but we get an idea of how to use window functions to tackle a problem step-by-step.
Now, let’s discuss about ranking functions commonly used with window functions. It can observed from the result set that RANK() skips the order when there is a tie but DENSE_RANK() remembers the tie and does not order in series. ROW_NUMBER() never gives two values same and hence is often used in deduplication. Also, using RANK() may also lead to loss of information, for example, what if we wanted to see players whose rank if 3 or 8 or 9? Maybe we could analyze further but we do not have that option whereas DENSE_RANK() does not skips the order.
I am also currently learning window functions and this problem may have other ways to be tackled, I would like to learn more if you have a different technique, please do comment!