The discussion in Part 3 clearly outlines how CTEs can be powerful agents of clarity and structure, but the discussion fails to highlight the key reason why CTEs were introduced into SQL in the first place: recursion. Enabling recursion within SQL is the primary reason for the existence of CTEs. So how do recursive CTEs work and why/when would we want to use them? To answer this question effectively, it may help to first look at some documentation (we will use Postgres' documentation since it is fairly easy to understand), then some examples, and then finally some Leetcode problems where this construct proves to be quite valuable (the PostgreSQL Exercises resource mentioned previously also has helpful material on recursive queries).
1. Documentation: Reading what the official documentation has to say about WITH Queries is a good investment of time, especially the portion about recursive query evaluation (as Postgres points out, the WITH RECURSIVE process is iteration not recursion, but RECURSIVE is the terminology chosen by the SQL standards committee.):
The general form of a recursive
WITHquery is always a non-recursive term, thenUNION(orUNION ALL), then a recursive term, where only the recursive term can contain a reference to the query's own output. Such a query is executed as follows:
- Evaluate the non-recursive term. For
UNION(but notUNION ALL), discard duplicate rows. Include all remaining rows in the result of the recursive query, and also place them in a temporary working table.- So long as the working table is not empty, repeat these steps:
- Evaluate the recursive term, substituting the current contents of the working table for the recursive self-reference. For
UNION(but notUNION ALL), discard duplicate rows and rows that duplicate any previous result row. Include all remaining rows in the result of the recursive query, and also place them in a temporary intermediate table.- Replace the contents of the working table with the contents of the intermediate table, then empty the intermediate table.
In more simple syntactical terms, here's what the process outlined above looks like:
WITH RECURSIVE cte_name AS(
CTE_auxiliary_query_1 -- non-recursive term
UNION [ALL]
CTE_auxiliary_query_2 -- recursive term
) CTE_primary_query;WITH RECURSIVE cte_name AS (...): The RECURSIVE keyword in WITH RECURSIVE tells our database that the CTE we are building is not like the normal CTE(s) that can be built using only the WITH keyword--our CTE will be built in an iterative fashion instead. cte_name is the label or name we are assigning to the CTE built in the iterative process (this name is typically used or referred to in CTE_auxiliary_query_2 and CTE_primary_query). The AS keyword simply denotes how cte_name is to be built throughout the iterative process (i.e., the ... part inside the parentheses).CTE_auxiliary_query_1: Evaluate the non-recursive term. When UNION is used, duplicate rows will be discarded; when UNION ALL is used, duplicate rows will be kept. Include all remaining rows in the result of the recursive query (i.e., cte_name being the result of the recursive query), and also place them in a temporary working table. This working table is what will be used or referred to in CTE_auxiliary_query_2 (i.e., the so-called "recursive term") to really kick off the iterative process.CTE_auxiliary_query_2: Evaluate the recursive term, substituting the current contents of the working table for the recursive self-reference; that is, the recursive self-reference will always be cte_name, but the working table depends on where we are in the iterative process.
CTE_auxiliary_query_1, if any, constitutes the working table and we can refer to this working table as cte_name in CTE_auxiliary_query_2. Note that the final result of the recursive query, which we ultimately refer to as cte_name in CTE_primary_query once the iterative process has finished, is different from the cte_name referred to in CTE_auxiliary_query_2 during each iteration. (This will become clearer momentarily by means of some examples.)cte_name referred to in CTE_auxiliary_query_2 refers to the output of the prior CTE_auxiliary_query_2 query; that is, for every iteration except the first, the temporary working table is the query output of CTE_auxiliary_query_2 for the previous iteration. Once the recursive term CTE_auxiliary_query_2 finishes executing, the following rough sequence of operations takes place:
CTE_auxiliary_query_2 has an empty query output, then the query terminates and control is passed to CTE_primary_query.CTE_auxiliary_query_2 has a non-empty query output, then the following sequence of operations takes place:
CTE_auxiliary_query_2 is appended to the total result set for the WITH RECURSIVE query (i.e., the cte_name we refer to in CTE_primary_query once the iterative process has finished). Duplicate rows will be kept when using UNION ALL but discarded when using only UNION.CTE_auxiliary_query_2 is placed in a temporary intermediate table.CTE_auxiliary_query_2 are replaced by the contents of the intermediate table referred to above. The intermediate table is then emptied so we can repeat this whole process again (i.e., the "Not beginning" process).CTE_primary_query: This is where WITH RECURSIVE really pays off, of course. The table we built in an iterative fashion, cte_name, is now available in its entirety for us to query from however we like. This could be a simple SELECT * FROM cte_name or it could be a much more complicated query involving joins and/or whatever else we might want to throw in there.2. Examples: Let's now consider some examples (executed using MySQL) to make clear all of the abstract details outlined above:
Try executing the following query:
WITH RECURSIVE nums_consecutive AS (
SELECT 1 AS num
UNION ALL
SELECT num + 1 FROM nums_consecutive WHERE num < 10
) SELECT * FROM nums_consecutive;This query yields the following result set:
+------+
| num |
+------+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
| 10 |
+------+How does this query work exactly? Here's a detailed breakdown using the previously discussed abstract process
WITH RECURSIVE cte_name AS(
CTE_auxiliary_query_1 -- non-recursive term
UNION [ALL]
CTE_auxiliary_query_2 -- recursive term
) CTE_primary_query;but this time we will have actual data to work with instead of just vague concepts:
WITH RECURSIVE cte_name AS (...): We specify via WITH RECURSIVE that we will not just be building a CTE but one that we will build iteratively. Note that we refer to nums_consecutive in CTE_auxiliary_query_2 (i.e., the recursive term):
SELECT num + 1 FROM nums_consecutive WHERE num < 10And we also refer to nums_consecutive in CTE_primary_query once the iterative process has finished and the final CTE has been built:
SELECT * FROM nums_consecutive;CTE_auxiliary_query_1: This is the non-recursive term and its result set gives us our first working table:
SELECT 1 AS numgives us
+-----+
| num |
+-----+
| 1 |
+-----+as our first working table
CTE_auxiliary_query_2: Since the query output from the non-recursive term was non-empty, the iterative process begins and the working table seen above can now be referred to as nums_consecutive in CTE_auxiliary_query_2:
SELECT num + 1 FROM nums_consecutive WHERE num < 10Since nums_consecutive above really refers to the working table
+-----+
| num |
+-----+
| 1 |
+-----+it's not hard to see why the result set of CTE_auxiliary_query_2 (i.e., SELECT num + 1 FROM nums_consecutive WHERE num < 10) for this iteration is simply
+-----+
| num |
+-----+
| 2 |
+-----+Pay special attention to the fact that the result set here is not
+-----+
| num |
+-----+
| 1 |
| 2 |
+-----+but simply
+-----+
| num |
+-----+
| 2 |
+-----+That is, the total result set being iteratively built right now may be
+-----+
| num |
+-----+
| 1 |
| 2 |
+-----+but this is not the current working table. The current working table is
+-----+
| num |
+-----+
| 1 |
+-----+and the temporary intermediate table is
+-----+
| num |
+-----+
| 2 |
+-----+By means of UNION ALL, the total result set being iteratively built is
+-----+
| num |
+-----+
| 1 |
| 2 |
+-----+but we have a way to go before the iterative process terminates. As indicated in the descriptions directly before this example began, we now replace the contents of the working table
+-----+
| num |
+-----+
| 1 |
+-----+with the contents of the intermediate table
+-----+
| num |
+-----+
| 2 |
+-----+and then we empty the intermediate table so it can be
+-----+
| num |
+-----+
| |
+-----+for the pending iteration. For the next iteration, with
+-----+
| num |
+-----+
| 2 |
+-----+as the working table, we see that our CTE_auxiliary_query_2, SELECT num + 1 FROM nums_consecutive WHERE num < 10, will give us
+-----+
| num |
+-----+
| 3 |
+-----+as the new temporary intermediate table, which will then replace the working table
+-----+
| num |
+-----+
| 2 |
+-----+and so on and so forth. This process will continue for some time until our working table is
+-----+
| num |
+-----+
| 9 |
+-----+at which point our CTE_auxiliary_query_2, SELECT num + 1 FROM nums_consecutive WHERE num < 10, will give us
+-----+
| num |
+-----+
| 10 |
+-----+as the new temporary intermediate table. This table replaces
+-----+
| num |
+-----+
| 9 |
+-----+as the working table and when our recursive term is executed again, that is, when SELECT num + 1 FROM nums_consecutive WHERE num < 10 runs against the working table
+-----+
| num |
+-----+
| 10 |
+-----+we actually come up empty since the WHERE num < 10 condition is not satisfied. Now both the working table and temporary intermediate table are empty and there's nothing left for the recursive term to run a query against; hence, the query terminates and control is passed to CTE_primary_query (i.e., SELECT * FROM nums_consecutive; in this example).
CTE_primary_query: The table we built in an iterative fashion, nums_consecutive, is now available in its entirety for us to run queries against as we please. Since we want everything from this small table in this example, we simply execute SELECT * FROM nums_consecutive; and this gives us the result set of having all the working tables UNIONed ALL together:
+------+
| num |
+------+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
| 10 |
+------+For this example, note how the working table has just a single row in each step, and it takes on the values from 1 through 10 in successive steps. In the 10th step, there is no output because of the WHERE clause, and so the query terminates.
Consider the query
WITH RECURSIVE seq_nums AS (
SELECT
5 AS num,
0 AS iteration
UNION ALL
SELECT
(2 * num + num),
iteration + 1
FROM
seq_nums
WHERE
iteration < 5
) SELECT * FROM seq_nums;with
+------+-----------+
| num | iteration |
+------+-----------+
| 5 | 0 |
| 15 | 1 |
| 45 | 2 |
| 135 | 3 |
| 405 | 4 |
| 1215 | 5 |
+------+-----------+as its result set. Note how, just as in the previous example, the working table has just a single row in each step, and it takes on the values 5, 15, 45, 135, and 405 in successive steps. Each iteration value simply denotes which application of CTE_auxiliary_query_2 has been used to generate the next row (the first row indicates the initial application of the non-recursive term CTE_auxiliary_query_1 as well as the first application of CTE_auxiliary_query_2 to generate the next row; the final row indicates the 5th application of CTE_auxiliary_query_2 which has an empty query output, thus terminating the query).
A classic question from the school days: Find the sum of the first 100 consecutive positive integers. But this time use SQL implementing a WITH RECURSIVE solution! We can use the work we did in the first example to generate numbers 1 through 100, inclusive, and then our CTE_primary_query can involve an aggregate function instead of just selecting everything from the CTE we just generated:
WITH RECURSIVE nums_consecutive AS (
SELECT 1 AS num
UNION ALL
SELECT num + 1 FROM nums_consecutive WHERE num < 100
) SELECT SUM(num) AS total_sum FROM nums_consecutive;Result set:
+-----------+
| total_sum |
+-----------+
| 5050 |
+-----------+The following is a somewhat more complicated usage of WITH RECURSIVE:
WITH RECURSIVE fib_table (FibBuilder, FibNum, FibIndex) AS (
SELECT 1, 0, 0
UNION ALL
SELECT FibNum, FibNum + FibBuilder, FibIndex + 1 FROM fib_table WHERE FibIndex < 15
)
SELECT FibBuilder, FibNum, FibIndex FROM fib_table;The result set:
+------------+--------+----------+
| FibBuilder | FibNum | FibIndex |
+------------+--------+----------+
| 1 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 1 | 2 |
| 1 | 2 | 3 |
| 2 | 3 | 4 |
| 3 | 5 | 5 |
| 5 | 8 | 6 |
| 8 | 13 | 7 |
| 13 | 21 | 8 |
| 21 | 34 | 9 |
| 34 | 55 | 10 |
| 55 | 89 | 11 |
| 89 | 144 | 12 |
| 144 | 233 | 13 |
| 233 | 377 | 14 |
| 377 | 610 | 15 |
+------------+--------+----------+3. Leetcode problems where WITH RECURSIVE can be utilized (not exhaustive): Click the problem title to directly visit problem or beside problem title to reveal at least one possible solution to the problem at hand using WITH RECURSIVE.
WITH RECURSIVE nums_full_list AS (
SELECT
N.Number, 1 AS num_count, N.Frequency
FROM
Numbers N
UNION ALL
SELECT
NFL.Number, NFL.num_count + 1, NFL.Frequency
FROM
nums_full_list NFL
WHERE NFL.num_count < NFL.Frequency
), ordered_nums AS (
SELECT
ROW_NUMBER() OVER (ORDER BY NFL.Number) AS num_order,
NFL.Number
FROM
nums_full_list NFL
), eligible_medians AS (
SELECT
O.Number
FROM
ordered_nums O
WHERE
O.num_order BETWEEN
(SELECT FLOOR(AVG(num_order)) FROM ordered_nums)
AND (SELECT CEIL(AVG(num_order)) FROM ordered_nums)
)
SELECT
AVG(E.Number) AS median
FROM
eligible_medians E;WITH RECURSIVE months_range AS (
SELECT 1 AS month_num
UNION ALL
SELECT month_num + 1 FROM months_range WHERE month_num < 12
), emps_and_months AS (
SELECT
*
FROM
(SELECT DISTINCT Id FROM Employee) X, months_range
), cum_sal_info AS (
SELECT
EM.Id,
EM.month_num AS Month,
E.Salary,
SUM(E.Salary) OVER (
PARTITION BY EM.Id
ORDER BY
EM.month_num ROWS BETWEEN 2 PRECEDING
AND 0 FOLLOWING
) AS cum_sal
FROM
emps_and_months EM
LEFT JOIN Employee E
ON EM.Id = E.Id AND EM.month_num = E.Month
)
SELECT
C.Id,
C.Month,
C.cum_sal AS Salary
FROM
cum_sal_info C
INNER JOIN Employee E1
ON E1.Id = C.Id
AND E1.Month = C.Month
AND C.Month < (
SELECT
MAX(E2.Month)
FROM
Employee E2
WHERE
E2.Id = E1.Id
)
ORDER BY
C.Id,
C.Month DESC;WITH RECURSIVE boss_chain AS (
SELECT
E1.employee_id
FROM
Employees E1
WHERE
E1.employee_id != 1 AND E1.manager_id = 1
UNION ALL
SELECT
E.employee_id
FROM
boss_chain B
INNER JOIN Employees E
ON B.employee_id = E.manager_id
)
SELECT DISTINCT
B.employee_id
FROM
boss_chain B;WITH RECURSIVE trans_counts AS (
SELECT
V.user_id,
V.visit_date,
COUNT(T.transaction_date) AS trans_count
FROM
Visits V
LEFT JOIN Transactions T
ON V.visit_date = T.transaction_date
AND V.user_id = T.user_id
GROUP BY
V.user_id, V.visit_date
), trans_count_vals AS (
SELECT MAX(TC.trans_count) as trans_id FROM trans_counts TC
UNION ALL
SELECT trans_id - 1 FROM trans_count_vals WHERE trans_id > 0
)
SELECT
TCV.trans_id AS transactions_count,
COUNT(TC.trans_count) AS visits_count
FROM
trans_count_vals TCV
LEFT JOIN trans_counts TC ON TC.trans_count = TCV.trans_id
GROUP BY
TC.trans_count, TCV.trans_id
ORDER BY
transactions_count;WITH RECURSIVE min_max_dates AS (
SELECT
MIN(S.period_start) min_period_date,
MAX(S.period_end) max_period_date
FROM Sales S
), period_dates AS (
SELECT
min_period_date AS period_date
FROM
min_max_dates
UNION ALL
SELECT
DATE_ADD(period_date, INTERVAL 1 day)
FROM
period_dates
WHERE
period_date <= (SELECT max_period_date FROM min_max_dates)
)
SELECT
S.product_id,
P.product_name,
LEFT(PD.period_date, 4) AS report_year,
COUNT(PD.period_date) * S.average_daily_sales AS total_amount
FROM
Sales S
INNER JOIN Product P
ON P.product_id = S.product_id
INNER JOIN period_dates PD
ON PD.period_date BETWEEN S.period_start AND S.period_end
GROUP BY
S.product_id, P.product_name, report_year, S.average_daily_sales
ORDER BY
S.product_id, report_year;WITH RECURSIVE possible_id_values AS (
SELECT MAX(C.customer_id) AS id_val FROM Customers C
UNION ALL
SELECT id_val - 1 FROM possible_id_values WHERE id_val > 1
)
SELECT
P.id_val AS ids
FROM
possible_id_values P
WHERE
P.id_val NOT IN (SELECT C2.customer_id FROM Customers C2)
ORDER BY
P.id_val;WITH RECURSIVE months AS (
SELECT 1 AS month, '2020-01-31' AS month_end
UNION ALL
SELECT month + 1, LAST_DAY(DATE_ADD(month_end, INTERVAL 1 MONTH)) FROM months WHERE month < 12
), active_drivers_by_month AS (
SELECT
M.month,
COUNT(D.driver_id) AS num_drivers
FROM
months M
LEFT JOIN Drivers D ON D.join_date <= M.month_end
GROUP BY
M.month
)
SELECT
M.month,
AD.num_drivers AS active_drivers,
COUNT(R.requested_at) AS accepted_rides
FROM
(SELECT ride_id, requested_at FROM Rides WHERE YEAR(requested_at) = 2020) R
INNER JOIN AcceptedRides AR ON AR.ride_id = R.ride_id
RIGHT JOIN months M ON M.month = MONTH(R.requested_at)
INNER JOIN active_drivers_by_month AD ON M.month = AD.month
GROUP BY
M.month
ORDER BY
M.month;WITH RECURSIVE months AS (
SELECT 1 AS month, '2020-01-31' AS month_end
UNION ALL
SELECT month + 1, LAST_DAY(DATE_ADD(month_end, INTERVAL 1 MONTH)) FROM months WHERE month < 12
), active_drivers_by_month AS (
SELECT
M.month,
COUNT(D.driver_id) AS num_drivers
FROM
months M
LEFT JOIN Drivers D ON D.join_date <= M.month_end
GROUP BY
M.month
), month_stats AS (
SELECT
M.month,
COUNT(DISTINCT AR.driver_id) AS accepting_drivers,
AD.num_drivers AS active_drivers,
COUNT(R.requested_at) AS accepted_rides
FROM
(SELECT ride_id, requested_at FROM Rides WHERE YEAR(requested_at) = 2020) R
INNER JOIN AcceptedRides AR ON AR.ride_id = R.ride_id
RIGHT JOIN months M ON M.month = MONTH(R.requested_at)
INNER JOIN active_drivers_by_month AD ON M.month = AD.month
GROUP BY
M.month
ORDER BY
M.month
)
SELECT
MS.month,
(CASE
WHEN MS.accepted_rides = 0 THEN 0
ELSE ROUND(100 * MS.accepting_drivers / MS.active_drivers, 2)
END) AS working_percentage
FROM
month_stats MS
ORDER BY
MS.month;WITH RECURSIVE months AS (
SELECT 1 AS month, '2020-01-31' AS month_end
UNION ALL
SELECT month + 1, LAST_DAY(DATE_ADD(month_end, INTERVAL 1 MONTH)) FROM months WHERE month < 12
), active_drivers_by_month AS (
SELECT
M.month,
COUNT(D.driver_id) AS num_drivers
FROM
months M
LEFT JOIN Drivers D ON D.join_date <= M.month_end
GROUP BY
M.month
), month_stats AS (
SELECT
M.month,
SUM(IFNULL(AR.ride_distance,0)) AS ride_distance,
SUM(IFNULL(AR.ride_duration,0)) AS ride_duration
FROM
(SELECT ride_id, requested_at FROM Rides WHERE YEAR(requested_at) = 2020) R
INNER JOIN AcceptedRides AR ON AR.ride_id = R.ride_id
RIGHT JOIN months M ON M.month = MONTH(R.requested_at)
INNER JOIN active_drivers_by_month AD ON M.month = AD.month
GROUP BY
M.month
)
SELECT
MS.month,
ROUND(AVG(MS.ride_distance) OVER(ORDER BY MS.month ROWS BETWEEN 0 PRECEDING AND 2 FOLLOWING),2)
AS average_ride_distance,
ROUND(AVG(MS.ride_duration) OVER(ORDER BY MS.month ROWS BETWEEN 0 PRECEDING AND 2 FOLLOWING),2)
AS average_ride_duration
FROM
month_stats MS
ORDER BY
MS.month
LIMIT 10;WITH RECURSIVE subtask_listing AS (
SELECT T.task_id, T.subtasks_count AS subtask_id FROM Tasks T
UNION ALL
SELECT SL.task_id, SL.subtask_id - 1
FROM subtask_listing SL
WHERE SL.subtask_id > 1
)
SELECT
*
FROM
subtask_listing SL
WHERE
NOT EXISTS (
SELECT
1
FROM
Executed E
WHERE
SL.task_id = E.task_id
AND SL.subtask_id = E.subtask_id
);WITH RECURSIVE activity_months AS (
SELECT
T.account_id,
DATE_FORMAT(MAX(T.day), '%Y-%m-01') AS activity_month,
DATE_FORMAT(MIN(T.day), '%Y-%m-01') AS min_date
FROM
Transactions T GROUP BY T.account_id
UNION ALL
SELECT
AD.account_id,
DATE_SUB(AD.activity_month, INTERVAL 1 MONTH), min_date
FROM activity_months AD WHERE activity_month > min_date
) SELECT * FROM activity_months ORDER BY account_id, activity_month;