Database SQL Primer (Part 3) [ Common Table Expressions (CTEs) ]
2907
Nov 28, 2021

Common Table Expressions (CTEs)

First the "what": What is a common table expression (CTE)? A matter-of-fact answer is given in [5]:

A CTE is a named subquery that appears at the top of a query in a WITH clause, which can contain multiple CTEs separated by commas. Along with making queries more understandable, this feature also allows each CTE to refer to any other CTE defined above it in the same WITH clause.

Then the "why": Most Leetcode queries require you to go beyond what is possible using the tables as they are made available in the provided SQL Schema, especially in relation to aggregate and window functions. Some queries require the use of a derived table, namely a subquery or a CTE.

As noted in [6],

Arguably the simplest way to create a virtual table that allows you to run queries on window functions or aggregate functions is a subquery. All that's required here is to write the query that you need within parentheses and then to write a second query that uses it.

Why, then, would we want to use something other than a subquery for such purposes?

Now the "how": Consider the following Leetcode problem: 1098. Unpopular Books. Here is a solution that uses a subquery:

-- answer to LC 1098 that uses a subquery
SELECT
  B.book_id,
  B.name
FROM
  Books B
WHERE
  B.available_from < DATE_SUB('2019-06-23', INTERVAL 1 MONTH)
  AND NOT EXISTS (
    SELECT
      1
    FROM
      Orders O
    WHERE
      O.book_id = B.book_id
      AND O.dispatch_date
        BETWEEN DATE_SUB('2019-06-23', INTERVAL 1 YEAR) AND '2019-06-23'
    GROUP BY
      O.book_id
    HAVING
      SUM(O.quantity) >= 10
  );

And now a solution that uses CTEs:

-- answer to LC 1098 that uses CTEs
WITH eligible_books AS (
  SELECT
    *
  FROM
    Books B
  WHERE B.available_from < DATE_SUB('2019-06-23', INTERVAL 1 MONTH)
), eligible_orders AS (
  SELECT
    *
  FROM
    Orders O
  WHERE O.dispatch_date > DATE_SUB('2019-06-23', INTERVAL 1 YEAR)
)
SELECT
  EB.book_id, EB.name
FROM
  eligible_books EB
  LEFT JOIN eligible_orders EO ON EB.book_id = EO.book_id
GROUP BY
  EB.book_id, EB.name
HAVING
  SUM(EO.quantity) IS NULL OR SUM(EO.quantity) < 10;

Which answer is clearer? Probably the answer using CTEs. Why? Probably because CTEs give you the chance to use named subqueries in a way that makes your code not only easier to read but also easier to structure.

Indeed, as noted in [6], CTEs were intended to overcome some of the limits of subqueries, especially as it relates to enabling the use of recursion within SQL (something that will be addressed momentarily). Consider the following made up query:

WITH head_count_tab (job, HeadCount) AS (
  SELECT
    job,
    COUNT(empno)
  FROM
    emp
  GROUP BY
    job
)
SELECT
  MAX(HeadCount) AS HighestJobHeadCount
FROM
  head_count_tab;

Although this query solves a simple problem, it illustrates the essential features of a CTE. We introduce the derived table using the WITH clause, specifying the column headings in the parentheses (optional in general), and use parentheses around the derived table's query itself. If we want to add more derived tables, we can add more as long as we separate each one with a comma and provide its name before its query (the reverse of how aliasing usually works in SQL).

Because the inner queries are presented before the outer query, in many circumstances they may also be considered more readable; they make it easier to study each logical element of the query separately in order to understand the logical flow. Of course, as with all things in coding, this will vary according to circumstances, and sometimes the subquery will be more readable.

In some cases, it may even make sense to mix subqueries and CTEs; for example, consider the following answer to 2084. Drop Type 1 Orders for Customers With Type 0 Orders

WITH Type0Orders AS (
  SELECT * FROM Orders O1 WHERE O1.order_type = 0
), ValidType1Orders AS (
  SELECT * FROM Orders O2 WHERE O2.customer_id NOT IN (SELECT customer_id FROM Type0Orders) AND O2.order_type = 1
)
SELECT * FROM Type0Orders
UNION ALL
SELECT * FROM ValidType1Orders;

This should give some sense of the versatility that CTEs have to offer. Ultimately, as noted in [6]:

There's not a lot of difference between a subquery and CTE in terms of usability. Both allow for nesting or writing more complicated queries that refer to other derived tables. However, once you start nesting many subqueries, readability is lessened because the meaning of different variables is hidden in successive query layers. In contrast, because a CTE arranges each element vertically, it is easier to understand the meaning of each element.

To bring closure to the point made in the last sentence above, consider the following two problems and their answers that utilize CTEs and what those answers might look like had they simply used subqueries:

1127. User Purchase Platform
WITH spending_dates_and_platforms AS (
  SELECT DISTINCT S.spend_date, 'mobile' AS platform FROM Spending S
  UNION ALL
  SELECT DISTINCT S.spend_date, 'desktop' AS platform FROM Spending S
  UNION ALL
  SELECT DISTINCT S.spend_date, 'both' AS platform FROM Spending S
), user_spending_by_day AS (
  SELECT
    S.spend_date,
    S.user_id,
    SUM((CASE WHEN S.platform = 'mobile' THEN S.amount ELSE 0 END)) mobile_amount,
    SUM((CASE WHEN S.platform = 'desktop' THEN S.amount ELSE 0 END)) desktop_amount
  FROM
    Spending S
  GROUP BY
    S.spend_date, S.user_id
), amounts_by_date_and_platform AS (
  SELECT
    U.spend_date,
    U.user_id,
    (CASE
      WHEN U.mobile_amount > 0 AND U.desktop_amount = 0 THEN 'mobile'
      WHEN U.mobile_amount = 0 AND U.desktop_amount > 0 THEN 'desktop'
      WHEN U.mobile_amount > 0 AND U.desktop_amount > 0 THEN 'both'
    END) AS platform,
    U.mobile_amount + U.desktop_amount AS amount
  FROM
    user_spending_by_day U
)
SELECT
  SDP.spend_date,
  SDP.platform,
  IFNULL(SUM(ADP.amount),0) AS total_amount,
  IFNULL(COUNT(ADP.user_id),0) AS total_users
FROM
  spending_dates_and_platforms SDP
  LEFT JOIN amounts_by_date_and_platform ADP
    ON SDP.spend_date = ADP.spend_date AND SDP.platform = ADP.platform
GROUP BY
  SDP.spend_date, SDP.platform
ORDER BY
  SDP.spend_date, FIELD(SDP.platform, 'desktop', 'mobile', 'both');

1205. Monthly Transactions II
WITH chargeback_details AS (
  SELECT
    DATE_FORMAT(C.trans_date, '%Y-%m') AS month,
    T.country,
    'charged back' AS state,
    T.amount
  FROM
    Chargebacks C
    INNER JOIN Transactions T ON T.id = C.trans_id
), appr_trans_details AS (
  SELECT
    DATE_FORMAT(T.trans_date, '%Y-%m') AS month,
    T.country,
    T.state,
    T.amount
  FROM
    Transactions T
  WHERE
    T.state = 'approved'
), eligible_transactions AS (
  SELECT * FROM chargeback_details
  UNION ALL
  SELECT * FROM appr_trans_details
)
SELECT
  ET.month,
  ET.country,
  SUM((CASE WHEN ET.state = 'approved' THEN 1 ELSE 0 END)) AS approved_count,
  SUM((CASE WHEN ET.state = 'approved' THEN ET.amount ELSE 0 END)) AS approved_amount,
  SUM((CASE WHEN ET.state = 'charged back' THEN 1 ELSE 0 END)) AS chargeback_count,
  SUM((CASE WHEN ET.state = 'charged back' THEN ET.amount ELSE 0 END)) AS chargeback_amount
FROM
  eligible_transactions ET
GROUP BY
  ET.month, ET.country;

Comments (2)