Database SQL Primer (Part 2) [ Window Functions ]
14162
Nov 28, 2021

Window Functions

Description: A very lucid description of window functions may be found in [2]:

After the database server has completed all of the steps necessary to evaluate a query, including joining, filtering, grouping, and sorting, the result set is complete and ready to be returned to the caller. Imagine if you could pause the query execution at this point and take a walk through the result set while it is still held in memory; what types of analysis might you want to do? If your result set contains sales data, perhaps you might want to generate rankings for salespeople or regions, or calculate percentage differences between one time period and another. If you are generating results for a financial report, perhaps you would like to calculate subtotals for each report section, and a grand total for the final section. Using analytic functions, you can do all of these things and more.

We may find a somewhat more dry description in [3]:

Once you understand the concept of grouping and using aggregates in SQL, understanding window functions is easy. Window functions, like aggregate functions, perform an aggregation on a defined set (a group) of rows, but rather than returning one value per group, window functions can return multiple values for each group. The group of rows to perform the aggregation on is the window.

Window functions certainly sound cool and promising. But given the fuss at the beginning of this study guide about query execution order, it makes sense we might ask the same question of window functions: When are window functions executed or their actions performed? We may find our answer in [4]:

It is important to note that window functions are performed as the last step in SQL processing prior to the ORDER BY clause.

This will now make sense if we look back at the mental model previously discussed for how to think about SELECT queries (relevant portion now in bold):

  1. FROM/JOIN (and all associated ON conditions)
  2. WHERE
  3. GROUP BY
  4. HAVING
  5. SELECT (including window functions)
  6. DISTINCT
  7. ORDER BY
  8. LIMIT/OFFSET

Example: Consider what is currently the top voted answer to problem 178. Rank Scores by StefanPochmann (he actually provides several different answers):

-- Answer 1 (uses correlated subquery)
SELECT
  Score,
  (SELECT COUNT(DISTINCT Score) FROM Scores WHERE Score >= S.Score) 'Rank'
FROM Scores S
ORDER BY Score DESC

-- Answer 2 (nested subquery)
SELECT
  Score,
  (SELECT COUNT(*) FROM (SELECT DISTINCT Score S FROM Scores) tmp WHERE S >= Score) 'Rank'
FROM Scores
ORDER BY Score DESC

-- Answer 3 (non-equi join)
SELECT S.Score, COUNT(DISTINCT t.score) 'Rank'
FROM Scores S JOIN Scores T ON S.Score <= T.score
GROUP BY S.Id
ORDER BY S.Score DESC

These are all great, perfectly legitimate answers. Each one has its own degree of sophistication. But I think nearly everyone would agree on thing: each answer is rather complex. Is there an easier way to craft an answer for this problem? Window functions to the rescue!

The DENSE_RANK() window function makes quick work of this problem:

SELECT
  S.Score,
  DENSE_RANK() OVER (ORDER BY S.Score DESC) AS 'Rank'
FROM
  Scores S;

Of course, not every problem will be so neatly solved by simply using a window function, but this example should give you an idea of just how powerful window functions can be.

Window Function Table Reference: The table below describes nonaggregate window functions that, for each row from a query, perform a calculation using rows related to that row. Most aggregate functions also can be used as window functions, namely AVG(), COUNT(), MAX(), MIN(), and SUM(). The "Docs" link for each function points to that function's reference in the official MySQL docs while the "Tutorial" link for each function points to that function's reference on the MySQL tutorial site. See the official MySQL docs for more detailed information.

FunctionDocsTutorialDescription
CUME_DIST()DocsTutorialCumulative distribution value
DENSE_RANK()DocsTutorialRank of current row within its partition, without gaps
FIRST_VALUE()DocsTutorialValue of argument from first row of window frame
LAG()DocsTutorialValue of argument from row lagging current row within partition
LAST_VALUE()DocsTutorialValue of argument from last row of window frame
LEAD()DocsTutorialValue of argument from row leading current row within partition
NTH_VALUE()DocsTutorialValue of argument from N-th row of window frame
NTILE()DocsTutorialBucket number of current row within its partition.
PERCENT_RANK()DocsTutorialPercentage rank value
RANK()DocsTutorialRank of current row within its partition, with gaps
ROW_NUMBER()DocsTutorialNumber of current row within its partition

Window Function Problem Reference: A variety of problems appear below for which a window function was used as part of the answer. The window function being used/emphasized is highlighted as a bullet point, the problem may be accessed by clicking on the problem title, and a solution to the problem (to highlight how the window function is being used) may be seen by clicking to expand the widget.

The listing below is not comprehensive. Many of the solutions revealed on click also do not comply with the style guidelines highlighted earlier in this study guide (remember, these were all written over the span of 2 weeks). When time permits, I may revisit these solutions and try to clean them up.

  • AVG()

    615. Average Salary: Departments VS Company
    SELECT
      DISTINCT DATE_FORMAT(X.pay_date, "%Y-%m") AS pay_month,
      X.department_id,
      (
        CASE
          WHEN X.sal_comp = 0 THEN "same"
          WHEN X.sal_comp < 0 THEN "lower"
          WHEN X.sal_comp > 0 THEN "higher"
        END
      ) AS comparison
    FROM
      (
        SELECT
          S.pay_date,
          E.department_id,
          AVG(S.amount) OVER (PARTITION BY MONTH(S.pay_date), E.department_id)
            - AVG(S.amount) OVER (PARTITION BY MONTH(S.pay_date)) AS sal_comp
        FROM
          salary S
          INNER JOIN employee E ON S.employee_id = E.employee_id
      ) X;

  • COUNT()

    1303. Find the Team Size
    SELECT
      E.employee_id,
      COUNT(E.team_id) OVER(PARTITION BY E.team_id) AS team_size
    FROM
      Employee E
    ORDER BY
      E.employee_id;

  • DENSE_RANK()

    178. Rank Scores
    SELECT
      S.Score,
      DENSE_RANK() OVER (ORDER BY S.Score DESC) AS 'Rank'
    FROM
      Scores S;

    184. Department Highest Salary
    SELECT
      X.Department,
      X.Employee,
      X.Salary
    FROM
      (
        SELECT
          D.Name AS Department,
          E.Name AS Employee,
          E.Salary,
          DENSE_RANK() OVER (
            PARTITION BY D.Name
            ORDER BY
              E.Salary DESC
          ) AS dep_sal_rank
        FROM
          Employee E
          INNER JOIN Department D ON E.DepartmentId = D.Id
      ) X
    WHERE
      X.dep_sal_rank = 1;

    185. Department Top Three Salaries
    SELECT
      X.Department,
      X.Employee,
      X.Salary
    FROM
      (
        SELECT
          D.Name AS Department,
          E.Name AS Employee,
          E.Salary,
          DENSE_RANK() OVER (
            PARTITION BY D.Name
            ORDER BY
              E.Salary DESC
          ) AS dep_sal_rank
        FROM
          Employee E
          INNER JOIN Department D ON E.DepartmentId = D.Id
      ) X
    WHERE
      X.dep_sal_rank <= 3;

    1875. Group Employees of the Same Salary
    WITH non_distinct_salaries AS (
      SELECT salary FROM Employees GROUP BY salary HAVING COUNT(*) > 1
    )
    SELECT
      E.employee_id,
      E.name,
      E.salary,
      DENSE_RANK() OVER(ORDER BY E.salary) AS team_id
    FROM
      Employees E
    WHERE
      E.salary IN (SELECT * FROM non_distinct_salaries)
    ORDER BY
      team_id, E.employee_id;

  • LAG()

    1709. Biggest Window Between Visits
    WITH visit_gaps AS (
      SELECT
        UV.user_id,
        DATEDIFF(
          LAG(UV.visit_date, 1, '2021-01-01') OVER(
            PARTITION BY UV.user_id ORDER BY UV.visit_date DESC
          ),
          UV.visit_date
        )
        AS visit_gap
      FROM
        UserVisits UV
    )
    SELECT
      V.user_id,
      MAX(V.visit_gap) AS biggest_window
    FROM
      visit_gaps V
    GROUP BY
      V.user_id
    ORDER BY
      V.user_id;

  • LEAD()

    1811. Find Interview Candidates
    WITH medal_winners AS (
      SELECT
        U.name,
        U.mail,
        (CASE WHEN
          LEAD(C.contest_id, 2) OVER(PARTITION BY U.user_id ORDER BY C.contest_id)
            - C.contest_id = 2 THEN 1 ELSE 0
        END) AS three_consec,
        (CASE WHEN U.user_id = C.gold_medal THEN 1 ELSE 0 END) AS gold_winner
      FROM
        Users U
        INNER JOIN Contests C ON U.user_id IN (C.gold_medal, C.silver_medal, C.bronze_medal)
    )
    SELECT
      MW.name,
      MW.mail
    FROM
      medal_winners MW
    GROUP BY
      MW.name, MW.mail
    HAVING
      SUM(MW.gold_winner) >= 3
      OR SUM(MW.three_consec) >= 1;

  • MAX()

    1084. Sales Analysis III
    SELECT DISTINCT
      X.product_id,
      X.product_name
    FROM
      (
        SELECT
          P.product_id,
          P.product_name,
          MIN(S.sale_date) OVER(W) AS min_sale_date,
          MAX(S.sale_date) OVER(W) AS max_sale_date
        FROM
          Sales S
          INNER JOIN Product P on S.product_id = P.product_id
        WINDOW W AS (PARTITION BY P.product_id)
      ) X
    WHERE
      X.min_sale_date >= '2019-01-01'
        AND X.max_sale_date <= '2019-03-31';

    1867. Orders With Maximum Quantity Above Average
    WITH order_aggregates AS (
      SELECT
        O.order_id,
        AVG(O.quantity) AS avg_quant,
        MAX(O.quantity) AS max_quant,
        MAX(AVG(O.quantity)) OVER() AS max_avg_quant
      FROM
        OrdersDetails O
      GROUP BY
        O.order_id
    )
    SELECT
      OA.order_id
    FROM
      order_aggregates OA
    WHERE
      OA.max_quant > OA.max_avg_quant;

  • MIN()

    1084. Sales Analysis III
    SELECT DISTINCT
      X.product_id,
      X.product_name
    FROM
      (
        SELECT
          P.product_id,
          P.product_name,
          MIN(S.sale_date) OVER(W) AS min_sale_date,
          MAX(S.sale_date) OVER(W) AS max_sale_date
        FROM
          Sales S
          INNER JOIN Product P on S.product_id = P.product_id
        WINDOW W AS (PARTITION BY P.product_id)
      ) X
    WHERE
      X.min_sale_date >= '2019-01-01'
        AND X.max_sale_date <= '2019-03-31';

  • RANK()

    512. Game Play Analysis II
    SELECT
      X.player_id,
      X.device_id
    FROM
      (
        SELECT
          A.*,
          RANK() OVER (
            PARTITION BY A.player_id
            ORDER BY
              A.event_date
          ) device_login_order
        FROM
          Activity A
      ) X
    WHERE
      X.device_login_order = 1;

    586. Customer Placing the Largest Number of Orders
    WITH customer_order_totals AS (
      SELECT
        O.customer_number,
        COUNT(*) AS total_orders
      FROM
        orders O
      GROUP BY
        O.customer_number
    ), ranked_customer_orders AS (
      SELECT
        C.customer_number,
        RANK() OVER(ORDER BY C.total_orders DESC) AS ranking
      FROM
        customer_order_totals C
    )
    SELECT
      R.customer_number
    FROM
      ranked_customer_orders R
    WHERE
      R.ranking = 1;

    1070. Product Sales Analysis III
    WITH ranked_sales_by_year AS (
      SELECT
        S.*,
        RANK() OVER(
          PARTITION BY S.product_id ORDER BY S.year
        ) year_ranking
      FROM
        Sales S
    )
    SELECT
      R.product_id,
      R.year AS first_year,
      R.quantity,
      R.price
    FROM
      ranked_sales_by_year R
    WHERE
      R.year_ranking = 1;

    1076. Project Employees II
    SELECT
      X.project_id
    FROM
      (
        SELECT
          P.project_id,
          RANK() OVER(ORDER BY COUNT(*) DESC) AS project_ranking
        FROM
          Project P
          INNER JOIN Employee E ON P.employee_id = E.employee_id
        GROUP BY
          P.project_id
      ) X
    WHERE
      X.project_ranking = 1;

    1077. Project Employees III
    SELECT
      X.project_id,
      X.employee_id
    FROM
      (
        SELECT
          P.project_id,
          E.employee_id,
          RANK() OVER(
            PARTITION BY P.project_id
            ORDER BY
              E.experience_years DESC
          ) AS years_rank
        FROM
          Project P
          INNER JOIN Employee E on P.employee_id = E.employee_id
      ) X
    WHERE
      X.years_rank = 1;

    1082. Sales Analysis I
    SELECT
      X.seller_id
    FROM
      (
        SELECT
          S.seller_id,
          RANK() OVER(ORDER BY SUM(S.price) DESC) seller_ranking
        FROM
          Sales S
        GROUP BY
          S.seller_id
      ) X
    WHERE
      X.seller_ranking = 1;

    1112. Highest Grade For Each Student
    WITH student_rankings AS (
      SELECT
        E.student_id,
        E.course_id,
        E.grade,
        RANK() OVER(
          PARTITION BY E.student_id 
          ORDER BY E.grade DESC, E.course_id
        ) AS ranking
      FROM
        Enrollments E
    )
    SELECT
      S.student_id,
      S.course_id,
      S.grade
    FROM
      student_rankings S
    WHERE
      S.ranking = 1
    ORDER BY
      S.student_id;

  • ROW_NUMBER()

    180. Consecutive Numbers
    WITH grps AS (
      SELECT
        Num,
        ROW_NUMBER() OVER (
          ORDER BY
            Id
        ) - ROW_NUMBER() OVER (
          PARTITION BY Num
          ORDER BY
            Id
        ) AS grp
      FROM
        Logs
    )
    SELECT
      DISTINCT Num AS ConsecutiveNums
    FROM
      grps
    GROUP BY
      grp,
      Num
    HAVING
      COUNT(grp) >= 3
    ORDER BY
      Num;

    601. Human Traffic of Stadium
    WITH grps AS (
      SELECT
        S.*,
        S.id - ROW_NUMBER() OVER (ORDER BY S.id) AS grp
      FROM
        stadium S
      WHERE
        S.people >= 100
    ), grp_freqs AS (
      SELECT
        G.*,
        COUNT(*) OVER(PARTITION BY G.grp) AS grp_freq
      FROM
        grps G
    )
    SELECT
      GF.id,
      GF.visit_date,
      GF.people
    FROM
      grp_freqs GF
    WHERE
      GF.grp_freq >= 3
    ORDER BY
      GF.visit_date;

    618. Students Report By Geography
    SELECT
      MAX((CASE WHEN X.continent = 'America' THEN X.name ELSE NULL END)) AS "America",
      MAX((CASE WHEN X.continent = 'Asia' THEN X.name ELSE NULL END)) AS "Asia",
      MAX((CASE WHEN X.continent = 'Europe' THEN X.name ELSE NULL END)) AS "Europe"
    FROM
      (
        SELECT
          S.continent,
          S.name,
          ROW_NUMBER() OVER(PARTITION BY S.continent ORDER BY S.name) AS rn
        FROM
          student S
      ) X
    GROUP BY
      X.rn;

  • SUM()

    534. Game Play Analysis III
    SELECT
      A.player_id,
      A.event_date,
      SUM(A.games_played) OVER (
        PARTITION BY 
          A.player_id
        ORDER BY
          A.event_date
      ) AS games_played_so_far
    FROM
      Activity A;

    579. Find Cumulative Salary of an Employee
    WITH RECURSIVE months_range AS (
      SELECT 1 AS month_num
      UNION ALL
      SELECT month_num + 1 FROM months_range WHERE month_num < 12
    ), emps_and_months AS (
      SELECT
        *
      FROM
        (SELECT DISTINCT Id FROM Employee) X, months_range
    ), cum_sal_info AS (
      SELECT
        EM.Id,
        EM.month_num AS Month,
        E.Salary,
        SUM(E.Salary) OVER (
          PARTITION BY EM.Id
          ORDER BY
            EM.month_num ROWS BETWEEN 2 PRECEDING
            AND 0 FOLLOWING
        ) AS cum_sal
      FROM
        emps_and_months EM
        LEFT JOIN Employee E
          ON EM.Id = E.Id AND EM.month_num = E.Month
    )
    SELECT
      C.Id,
      C.Month,
      C.cum_sal AS Salary
    FROM
      cum_sal_info C
      INNER JOIN Employee E1
      ON E1.Id = C.Id
        AND E1.Month = C.Month
        AND C.Month < (
          SELECT
            MAX(E2.Month)
          FROM
            Employee E2
          WHERE
            E2.Id = E1.Id
        )
    ORDER BY
      C.Id,
      C.Month DESC;

    1308. Running Total for Different Genders
    SELECT
      S.gender,
      S.day,
      SUM(S.score_points) OVER(
        PARTITION BY S.gender ORDER BY S.day
      ) AS total
    FROM
      Scores S
    ORDER BY
      S.gender, S.day;

Comments (10)