Database SQL Primer (Part 1) [ Overview | Query Execution Order

Database SQL Primer (Part 1) [ Overview | Query Execution Order | References ]

11835

Overview

Note: The four-part series that follows details my journey and helpful things I discovered along the way in solving all of the Leetcode database problems over roughly 2 weeks in July 2021. This was originally only 1 post, but you will quickly see why I ran into a content length issue (didn't know Leetcode had that for its discussion posts!). The references included at the end apply to all four parts.

Personal background: I originally had somewhat of a disdain for SQL. As with all developers, you will have expertise in some realms of development and need for improvement in others. It didn't take long in my development career (I'm still quite new) for me to realize that SQL was somewhat of a handicap I needed to overcome in order to be a more complete developer. Most people on Leetcode, including myself, are not here for the database problems but for the data structure and algorithm problems. Nonetheless, I wanted to improve my SQL capabilities, and I was happy to see Leetcode had a generous selection of database problems.

Preparation for Leetcode database problems: It didn't take long for me to realize I wasn't all that prepared for many of the SQL problems Leetcode had to offer. As with the data structure and algorithm problems, Leetcode assumes a certain level of familiarity on the part of the user. The problems given are supposed to be similar to interview problems after all!

All this to say: I realized I needed to do some "preparation" before I began my Leetcode work in earnest. After a few different false starts, I'm convinced the following resources will be helpful to anyone looking to adequately prepare for all of Leetcode's database problems:

SQL Tutorial (sqltutorial . org): General SQL reference for true SQL beginners. Downloading the sample database is highly encouraged for learning in an interactive fashion.
MySQL Tutorial (mysqltutorial . org): This is an excellent MySQL reference. It's similar to the reference immediately above, but this resource is restricted to MySQL as opposed to being completely database agnostic. Again, downloading the sample database is highly encouraged so as to learn in an interactive fashion.
SQLBolt (sqlbolt . com): Work through a tutorial that covers many SQL basics via interactive lessons.
PostgreSQL Exercises (pgexercises . com): Even if you are not using Postgres as your database, this is an excellent resource to work through in its entirety before embarking on all of the Leetcode problems. Its contents are largely applicable (except for issues of syntax and maybe a few other minor details) to all major databases (i.e., MySQL, SQL Server, Oracle, etc.). Its exercises cover SQL basics, joins and subqueries, data modification, aggregates, dates, strings, and recursive queries.

Once I had gone through the resources above, specifically the Postgres exercises, I felt like I was probably ready to tackle the Leetcode database problems in their entirety. I worked through every single Leetcode database problem over the span of about two weeks in July 2021. This is almost certainly not a normal pace--I devoted a great deal of time each day to knocking out the problems. And I learned a great deal in the process.

Study guide motivation and description: The two weeks I devoted to completing all of the Leetcode database problems were rather eye-opening. I quickly learned 5 things, all of which I will expound on more later in this study guide:

Understanding query execution order is critical. How can you accurately determine your result set if you do not know how it is being formed?
Consistent formatting matters. This is not just a cosmetic issue--your style and formatting can impact how you think.

The next three points were probably the biggest revelation to me throughout the course of my two week journey:

Window functions make difficult problems easy. There's a reason why window functions show up in most Medium/Hard problems. They're immensely powerful when used with care. I provide a listing of different window functions and some of the problems in which I have found them to be helpful.
Common table expressions (CTEs) are agents of clarity and utility. Have you ever gotten tired of making a mess by using sizable subqueries? CTEs provide the ability to use "named subqueries" that are especially useful when dealing with hierarchical data. You can also use recursive CTEs (more on this later) which add a whole new dimension to crafting powerful queries to solve complex problems. I provide problem solutions to illustrate how CTEs can be agents of clarity and utility.
Recursive CTEs. Tricky at first but opens up all sorts of possibilities, especially when it comes to dealing with hierarchical data. I provide some references, examples, and, just as with window functions, I provide a listing of different Leetcode problems where using WITH RECURSIVE is a viable option.

Everything below delves into each of these points in greater detail.

Note: I did all of the database problems with a Leetcode premium subscription. Most of the database problems require a premium subscription; hence, when Leetcode problems are explicitly referenced in this guide, it is fairly safe to assume they are problems that require a subscription to view. Most of the material in this guide, however, will be of use with or without a subscription.

Order of Execution of a Query

Short Version: The logical processing order of the SELECT statement is generally as follows:

FROM/JOIN (and all associated ON conditions)
WHERE
GROUP BY
HAVING
SELECT (including window functions)
DISTINCT
ORDER BY
LIMIT/OFFSET

Long Version:

Motivation: Without hesitation, compute the following as quickly as you can in your head:

1 + 2 / 3 * 4 ^ 5 - 6

This immediately comes off as a "trick PEMDAS problem," where the goal is to "test" whether or not someone "really" understands the order of operations for evaluating mathematical expressions. But you would never encounter an expression like the one above in the wild. Why? Because it's been intentionally made unclear as to how to go about evaluating the given expression. Specifically, the complete absence of parentheses makes it unnecessarily difficult to understand what the result of the mathematical expression should be.

Importance: Trying to write a SQL SELECT statement without understanding its logical processing order is like trying to evaluate a complicated mathematical expression with no parentheses. It's unnecessarily difficult, and it can lead to uncertainty in what the result set for a query should be.

Documentation: If you search for order of execution of a SQL query, then you are likely to stumble across threads on Stack Overflow, and who knows what else. But there's little in the way of "official documentation" except the following helpful excerpt from the SQL Server docs

The following steps show the logical processing order, or binding order, for a SELECT statement. This order determines when the objects defined in one step are made available to the clauses in subsequent steps. For example, if the query processor can bind to (access) the tables or views defined in the FROM clause, these objects and their columns are made available to all subsequent steps. Conversely, because the SELECT clause is step 8, any column aliases or derived columns defined in that clause cannot be referenced by preceding clauses. However, they can be referenced by subsequent clauses such as the ORDER BY clause. The actual physical execution of the statement is determined by the query processor and the order may vary from this list.

FROM

ON

JOIN

WHERE

GROUP BY

WITH CUBE or WITH ROLLUP

HAVING

SELECT

DISTINCT

ORDER BY

TOP

This excerpt ends with a warning:

The preceding sequence is usually true. However, there are uncommon cases where the sequence may differ.

Mental model (how to think about SELECT statements): For the sake of simplicity, the list above may be expressed in the following manner to capture the (usually true) logical processing order of the SELECT statement:

FROM/JOIN (and all associated ON conditions)
WHERE
GROUP BY
HAVING
SELECT (including window functions)
DISTINCT
ORDER BY
LIMIT/OFFSET

The listing above is how you should think about a SELECT query, but this is different from how you would write a SELECT query. How do we typically write a SELECT query?

Writing model (how to write SELECT statements): If we were to hop over to the Postgres docs, then we might be a little overwhelmed by what's provided for the "synopsis" of the SELECT statement (reproduced below).

SELECT statement synopsis (Postgres)

[ WITH [ RECURSIVE ] with_query [, ...] ]
SELECT [ ALL | DISTINCT [ ON ( expression [, ...] ) ] ]
    [ * | expression [ [ AS ] output_name ] [, ...] ]
    [ FROM from_item [, ...] ]
    [ WHERE condition ]
    [ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ]
    [ HAVING condition ]
    [ WINDOW window_name AS ( window_definition ) [, ...] ]
    [ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select ]
    [ ORDER BY expression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...] ]
    [ LIMIT { count | ALL } ]
    [ OFFSET start [ ROW | ROWS ] ]
    [ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } { ONLY | WITH TIES } ]
    [ FOR { UPDATE | NO KEY UPDATE | SHARE | KEY SHARE } [ OF table_name [, ...] ] [ NOWAIT | SKIP LOCKED ] [...] ]

where from_item can be one of:

    [ ONLY ] table_name [ * ] [ [ AS ] alias [ ( column_alias [, ...] ) ] ]
                [ TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ] ]
    [ LATERAL ] ( select ) [ AS ] alias [ ( column_alias [, ...] ) ]
    with_query_name [ [ AS ] alias [ ( column_alias [, ...] ) ] ]
    [ LATERAL ] function_name ( [ argument [, ...] ] )
                [ WITH ORDINALITY ] [ [ AS ] alias [ ( column_alias [, ...] ) ] ]
    [ LATERAL ] function_name ( [ argument [, ...] ] ) [ AS ] alias ( column_definition [, ...] )
    [ LATERAL ] function_name ( [ argument [, ...] ] ) AS ( column_definition [, ...] )
    [ LATERAL ] ROWS FROM( function_name ( [ argument [, ...] ] ) [ AS ( column_definition [, ...] ) ] [, ...] )
                [ WITH ORDINALITY ] [ [ AS ] alias [ ( column_alias [, ...] ) ] ]
    from_item [ NATURAL ] join_type from_item [ ON join_condition | USING ( join_column [, ...] ) [ AS join_using_alias ] ]

and grouping_element can be one of:

    ( )
    expression
    ( expression [, ...] )
    ROLLUP ( { expression | ( expression [, ...] ) } [, ...] )
    CUBE ( { expression | ( expression [, ...] ) } [, ...] )
    GROUPING SETS ( grouping_element [, ...] )

and with_query is:

    with_query_name [ ( column_name [, ...] ) ] AS [ [ NOT ] MATERIALIZED ] ( select | values | insert | update | delete )
        [ SEARCH { BREADTH | DEPTH } FIRST BY column_name [, ...] SET search_seq_col_name ]
        [ CYCLE column_name [, ...] SET cycle_mark_col_name [ TO cycle_mark_value DEFAULT cycle_mark_default ] USING cycle_path_col_name ]

TABLE [ ONLY ] table_name [ * ]

We can produce a more useful writing model by stripping the general SELECT statement down to its essentials:

-- writing model for a general SELECT statement
SELECT DISTINCT 
  column, AGG_FUNC(column_or_expression), ...
FROM 
  mytable M JOIN another_table A ON M.column = A.column
WHERE 
  constraint_expression
GROUP BY 
  column
HAVING 
  constraint_expression
ORDER BY 
  column ASC/DESC
LIMIT 
  returnCount OFFSET skipCount;

Tabular summary of writing model:

Order	Operation	Can use column alias?	Note
`1`	`FROM`/`JOIN`	❌	Declare table aliases here and `FROM` what table (or combined tables when using one or more `JOIN`s) you will be selecting data; the total working set of data is determined here--the resultant working set of data may be thought of as a "virtual table" to which all subsequent operations will apply (this is the idea behind a view which is essentially a commonly used working set of data or "virtual table" created by a query, often a somewhat complicated one, whose result set is given a name (i.e., the name of the view); subsequent queries can then be easily written to target this predetermined working set of data, virtual table, named query, view or whatever you want to call it--such a construct is most often called a view)
`2`	`WHERE`	❌	Apply first-pass constraints to individual rows
`3`	`GROUP BY`	❌	Group remaining rows (i.e., the rows remaining after the first-pass `WHERE` constraints have filtered out all individual rows not meeting the specified criteria) based on common values in specified column(s); be mindful when using more than one field in this clause, namely `GROUP BY ColOrExpr` means put all those rows with the same value for `ColOrExpr` in one group while something like `GROUP BY ColOrExpr1, ..., ColOrExprN` means put all those rows with the same values for all `N` columns or expressions in one group
`4`	`HAVING`	❌	Apply constraints to an aggregate or a group of rows that have been `GROUPed BY` some criteria; essentially, `WHERE` applies to individual rows while `HAVING` applies to groups of rows
`5`	`SELECT`	―	Declare column aliases here
`6`	`DISTINCT`	―	Use a more restrictive version of `SELECT` by specifying which rows should be considered duplicates, and thus removed from the result set, by using the syntax `SELECT DISTINCT` which generally takes one of two forms: `SELECT DISTINCT ColOrExpr`: Here, only one column or expression is specified so the values from this column or expression are used to evaluate the duplicate. `SELECT DISTINCT ColOrExpr1, ..., ColOrExprN`: Here, `N` columns and/or expressions have been specified so the combination of values of these columns and/or expressions will be used to evaluate the duplicates
`7`	`ORDER BY`	✅	Sort a query result set by using the syntax `ORDER BY SortExpression1 [ASC ｜ DESC], ..., SortExpressionN [ASC ｜ DESC]` where `SortExpression` can be either a column or an expression and `ASC ｜ DESC` specifies whether to sort the column or expression in `ASC`ending order (default) or in `DESC`ending order; note that you can sort a result set by multiple columns and/or expressions where each listed `SortExpression` takes precedence over the next (e.g., something like `ORDER BY LastName DESC, Age ASC` would order rows first by `LastName` in descending order and then, within that ordered set, sort the remaining results by `AGE` in ascending order)
`8`	`LIMIT`/`OFFSET`	✅	Return a (possibly offset) subset of rows generated by a query; the syntax `LIMIT RowsToReturn OFFSET RowsToSkip` indicates the maximum number of rows that should be returned by a query (i.e, `RowsToReturn`) as well as how many rows to skip from the initial query results (i.e., `RowsToSkip`); almost always use `ORDER BY` when using `LIMIT/OFFSET` to ensure a predictable result set--this is due to the fact that the order in which records are returned from a database is often random unless order is imposed by means of `ORDER BY`

List summary of writing model:

FROM/JOIN: The FROM clause and subsequent JOINs are first executed to determine the total working set of data that is being queried. This includes subqueries in this clause, and can cause temporary tables to be created under the hood containing all the columns and rows of the tables being joined.
WHERE: Once we have the total working set of data, the first-pass WHERE constraints are applied to the individual rows. The rows that do not satisfy the constraint are discarded. Each of the constraints can only access columns directly from the tables requested in the FROM clause. Aliases in the SELECT part of the query are not accessible in most databases since they may include expressions dependent on parts of the query that have not yet executed.
GROUP BY: The remaining rows after the WHERE constraints are applied are then grouped based on common values in the column(s) specified in the GROUP BY clause. As a result of the grouping, there will only be as many rows as there are unique values in that column. Implicitly, this means that you should only need to use this when you have aggregate functions in your query.
HAVING: If the query has a GROUP BY clause, then the constraints in the HAVING clause are then applied to the grouped rows, discarding the grouped rows that do not satisfy the constraint(s). Like the WHERE clause, aliases are also not accessible from this step in most databases. Finally, just remember that the WHERE clause is to individual rows what HAVING is to grouped rows by means of GROUP BY (i.e., HAVING is the multi-row version of WHERE).
SELECT: Any expressions in the SELECT part of the query are finally computed.
DISTINCT: Of the remaining rows, rows with duplicate values in the column(s) marked as DISTINCT will be discarded.
ORDER BY: If an order is specified by the ORDER BY clause, the rows are then sorted by the specified data in either ascending or descending order. Since all the expressions in the SELECT part of the query have been computed, you can reference aliases in this clause.
LIMIT/OFFSET: Finally, the rows that fall outside the range specified by the LIMIT and OFFSET are discarded, leaving the final set of rows to be returned from the query.

Style Guide

Rationale behind formatting guidelines below

As noted in [1], there is no agreed-upon standard for formatting SQL queries, and any guide for writing code is necessarily subjective. What is true objectively, however, is that your goal should be to communicate what your query is doing.

Formatting is important in order to accomplish this goal. It is not simply a cosmetic concern. Think about how difficult it would be to read text without punctuation, capitalization, paragraphs, indentation, etc. Less dramatically, a SQL query should be crafted so that its intention is clear; clarity should come at the cost of concision (except, of course, if this entails performance issues and the like).

For example, the following is a one-line solution to the first Leetcode database problem, 175. Combine Two Tables:

SELECT FirstName, LastName, City, State FROM Person LEFT JOIN Address USING(PersonId);

This is a perfectly legitimate solution, but could it be written more clearly? Of course:

SELECT
  P.FirstName,
  P.LastName,
  A.City,
  A.State
FROM
  Person P
  LEFT JOIN Address A ON P.PersonId = A.PersonId;

Adopting your own formatting style will inevitably take some time, but try to be judicious and consistent in the habits you form and the practices you choose. I share below different rules I have adopted that have been heavily influenced by [1], my everyday development work, and, of course, my experience in solving all of the Leetcode database problems. I have found following these rules to be helpful in clarifying my own mental model when writing queries, and I hope you find them to be similarly helpful. Take what serves you well and leave the rest.

Formatting Guidelines: Click the widget above to see the rationale behind everything that appears immediately below.

CATEGORY	DESCRIPTION
Table aliases	Use table aliases that are abbreviations for the table name. It helps to define table aliases without the `AS` keyword so as to immediately distinguish table aliases from column aliases (see next line).
Column aliases	Use `AS` to define column aliases. Columns are generally qualified, meaning that they use table aliases.
Dot notation	Use dot notation (i.e., the membership operator `.`) in conjunction with table aliases to clearly indicate from which table a column is being accessed. For example, if you are trying to access the column `UserId` from the `Users` table, then the `Users` table can be aliased with `U` and the `UserId` column can be accessed via dot notation as follows: `U.UserId`.
Consistency	Be consistent in capitalization, in usage of underscores, indentation, etc.
Readability	Write the code to be understandable, so you and someone else can read it.
Table and column names	Always use only alphanumeric characters and underscores for table and column names. Other characters, such as spaces, require that the name be escaped when referenced. The escape characters, typically double quotes or square braces, make it hard to write and read queries.
Plurality of table names	Table names are usually in plural (this helps avoid the problem with reserved words) and reinforces the idea that tables contain multiple instances of the entity.
Singularity of primary key column names	The primary key is the singular of the table name followed by "Id." For example, `OrderId` and `SubscriberId` could be expected primary key column names for an `Orders` and `Subscribers` table, respectively.
Foreign key and primary key naming consistency	When a foreign key column references another table's primary key column, use the exact same name for both columns, ensuring consistency and making it easy to see relationships between tables.
CamelCase	"CamelBack" case is used (upper case for each new word, lowercase for the rest). Hence, `OrderId` instead of `Order_Id`. In general, table names and column names are not case sensitive. The CamelBack method is used to make it easier to read the name, while at the same time keeping the name shorter (than if using underscores). Most table and column names use CamelBack casing.
Underscore usage	The underscore is used for grouping common columns together. For instance, in a `Calendar` table, the indicators for holidays for specific religions might start with `hol_`.
Reserved words	Refrain from using SQL reserved words. Databases have their own special words, but words like `ORDER`, `GROUP`, and `VALUES` are keywords in the language and should be avoided. Most keywords are capitalized.
Left alignment of high-level clauses	The high-level clauses defined by the SQL language are all aligned on the left. These are `WITH`, `SELECT`, `FROM`, `WHERE`, `GROUP BY`, `HAVING`, and `ORDER BY`.
Alignment within clauses	Within a clause, subsequent lines are aligned after and (usually) underneath the keyword, so the scope of each clause is visually obvious.
Alignment within subqueries	Subqueries follow similar rules, so all the main clauses of a subquery are indented, but still aligned on the left.
Alignment within `FROM` clause	Within the `FROM` clause, table names and subqueries start on a new line (the tables are then aligned and easier to see). The `JOIN` keywords (i.e., `LEFT JOIN`, `INNER JOIN`, etc.) start on their own line and the `ON` predicate immediately follows (i.e., appears on the same line as its associated `JOIN` keyword).
Operator spacing	Operators generally have spaces around them.
Comma placement	Commas are at the end of a line, just as a human would place them.
Parentheses across multiple lines	A closing parenthesis, when on a subsequent line, is aligned under the opening parenthesis.
`CASE` statements and parentheses	`CASE` statements are always surrounded by parentheses.

Some of the guidelines above are less relevant than others in the context of solving Leetcode problems (e.g., you have no choice concerning the naming of tables in whatever schema you are provided for a particular problem), but I have found most of them to be immensely useful in the process of cobbling together my own queries.

As always, rules are made to be broken. The rules above are meant to provide you freedom and clarity. They are not meant to be a straightjacket. I try to follow my own rules most of the time, but I occasionally make exceptions (especially when using UNION or UNION ALL for some reason) or slip up (more often than I care to admit). Adopt what you find to be helpful. Leave the rest.

References

[1].
Gordon Linoff.
Data Analysis Using SQL and Excel.
Wiley. 2016. 39.

[2].
Alan Beaulieu.
Learning SQL.
O'Reilly. 2020. 267.

[3].
Anthony Molinaro.
SQL Cookbook.
O'Reilly. 2021. 519.

[4].
Anthony Molinaro.
SQL Cookbook.
O'Reilly. 2021. 520.

[5].
Alan Beaulieu.
Learning SQL.
O'Reilly. 2020. 181.

[6].
Anthony Molinaro.
SQL Cookbook.
O'Reilly. 2021. 535-537.

Comments (13)