Note: The four-part series that follows details my journey and helpful things I discovered along the way in solving all of the Leetcode database problems over roughly 2 weeks in July 2021. This was originally only 1 post, but you will quickly see why I ran into a content length issue (didn't know Leetcode had that for its discussion posts!). The references included at the end apply to all four parts.
Personal background: I originally had somewhat of a disdain for SQL. As with all developers, you will have expertise in some realms of development and need for improvement in others. It didn't take long in my development career (I'm still quite new) for me to realize that SQL was somewhat of a handicap I needed to overcome in order to be a more complete developer. Most people on Leetcode, including myself, are not here for the database problems but for the data structure and algorithm problems. Nonetheless, I wanted to improve my SQL capabilities, and I was happy to see Leetcode had a generous selection of database problems.
Preparation for Leetcode database problems: It didn't take long for me to realize I wasn't all that prepared for many of the SQL problems Leetcode had to offer. As with the data structure and algorithm problems, Leetcode assumes a certain level of familiarity on the part of the user. The problems given are supposed to be similar to interview problems after all!
All this to say: I realized I needed to do some "preparation" before I began my Leetcode work in earnest. After a few different false starts, I'm convinced the following resources will be helpful to anyone looking to adequately prepare for all of Leetcode's database problems:
sqltutorial . org): General SQL reference for true SQL beginners. Downloading the sample database is highly encouraged for learning in an interactive fashion.mysqltutorial . org): This is an excellent MySQL reference. It's similar to the reference immediately above, but this resource is restricted to MySQL as opposed to being completely database agnostic. Again, downloading the sample database is highly encouraged so as to learn in an interactive fashion.sqlbolt . com): Work through a tutorial that covers many SQL basics via interactive lessons.pgexercises . com): Even if you are not using Postgres as your database, this is an excellent resource to work through in its entirety before embarking on all of the Leetcode problems. Its contents are largely applicable (except for issues of syntax and maybe a few other minor details) to all major databases (i.e., MySQL, SQL Server, Oracle, etc.). Its exercises cover SQL basics, joins and subqueries, data modification, aggregates, dates, strings, and recursive queries.Once I had gone through the resources above, specifically the Postgres exercises, I felt like I was probably ready to tackle the Leetcode database problems in their entirety. I worked through every single Leetcode database problem over the span of about two weeks in July 2021. This is almost certainly not a normal pace--I devoted a great deal of time each day to knocking out the problems. And I learned a great deal in the process.
Study guide motivation and description: The two weeks I devoted to completing all of the Leetcode database problems were rather eye-opening. I quickly learned 5 things, all of which I will expound on more later in this study guide:
The next three points were probably the biggest revelation to me throughout the course of my two week journey:
WITH RECURSIVE is a viable option.Everything below delves into each of these points in greater detail.
Note: I did all of the database problems with a Leetcode premium subscription. Most of the database problems require a premium subscription; hence, when Leetcode problems are explicitly referenced in this guide, it is fairly safe to assume they are problems that require a subscription to view. Most of the material in this guide, however, will be of use with or without a subscription.
Short Version: The logical processing order of the SELECT statement is generally as follows:
FROM/JOIN (and all associated ON conditions)WHEREGROUP BYHAVINGSELECT (including window functions)DISTINCTORDER BYLIMIT/OFFSETLong Version:
Motivation: Without hesitation, compute the following as quickly as you can in your head:
1 + 2 / 3 * 4 ^ 5 - 6This immediately comes off as a "trick PEMDAS problem," where the goal is to "test" whether or not someone "really" understands the order of operations for evaluating mathematical expressions. But you would never encounter an expression like the one above in the wild. Why? Because it's been intentionally made unclear as to how to go about evaluating the given expression. Specifically, the complete absence of parentheses makes it unnecessarily difficult to understand what the result of the mathematical expression should be.
Importance: Trying to write a SQL SELECT statement without understanding its logical processing order is like trying to evaluate a complicated mathematical expression with no parentheses. It's unnecessarily difficult, and it can lead to uncertainty in what the result set for a query should be.
Documentation: If you search for order of execution of a SQL query, then you are likely to stumble across threads on Stack Overflow, and who knows what else. But there's little in the way of "official documentation" except the following helpful excerpt from the SQL Server docs
The following steps show the logical processing order, or binding order, for a
SELECTstatement. This order determines when the objects defined in one step are made available to the clauses in subsequent steps. For example, if the query processor can bind to (access) the tables or views defined in theFROMclause, these objects and their columns are made available to all subsequent steps. Conversely, because theSELECTclause is step 8, any column aliases or derived columns defined in that clause cannot be referenced by preceding clauses. However, they can be referenced by subsequent clauses such as theORDER BYclause. The actual physical execution of the statement is determined by the query processor and the order may vary from this list.
FROMONJOINWHEREGROUP BYWITH CUBEorWITH ROLLUPHAVINGSELECTDISTINCTORDER BYTOP
This excerpt ends with a warning:
The preceding sequence is usually true. However, there are uncommon cases where the sequence may differ.
Mental model (how to think about SELECT statements): For the sake of simplicity, the list above may be expressed in the following manner to capture the (usually true) logical processing order of the SELECT statement:
FROM/JOIN (and all associated ON conditions)WHEREGROUP BYHAVINGSELECT (including window functions)DISTINCTORDER BYLIMIT/OFFSETThe listing above is how you should think about a SELECT query, but this is different from how you would write a SELECT query. How do we typically write a SELECT query?
Writing model (how to write SELECT statements): If we were to hop over to the Postgres docs, then we might be a little overwhelmed by what's provided for the "synopsis" of the SELECT statement (reproduced below).
SELECT statement synopsis (Postgres)[ WITH [ RECURSIVE ] with_query [, ...] ]
SELECT [ ALL | DISTINCT [ ON ( expression [, ...] ) ] ]
[ * | expression [ [ AS ] output_name ] [, ...] ]
[ FROM from_item [, ...] ]
[ WHERE condition ]
[ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ]
[ HAVING condition ]
[ WINDOW window_name AS ( window_definition ) [, ...] ]
[ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select ]
[ ORDER BY expression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...] ]
[ LIMIT { count | ALL } ]
[ OFFSET start [ ROW | ROWS ] ]
[ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } { ONLY | WITH TIES } ]
[ FOR { UPDATE | NO KEY UPDATE | SHARE | KEY SHARE } [ OF table_name [, ...] ] [ NOWAIT | SKIP LOCKED ] [...] ]
where from_item can be one of:
[ ONLY ] table_name [ * ] [ [ AS ] alias [ ( column_alias [, ...] ) ] ]
[ TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ] ]
[ LATERAL ] ( select ) [ AS ] alias [ ( column_alias [, ...] ) ]
with_query_name [ [ AS ] alias [ ( column_alias [, ...] ) ] ]
[ LATERAL ] function_name ( [ argument [, ...] ] )
[ WITH ORDINALITY ] [ [ AS ] alias [ ( column_alias [, ...] ) ] ]
[ LATERAL ] function_name ( [ argument [, ...] ] ) [ AS ] alias ( column_definition [, ...] )
[ LATERAL ] function_name ( [ argument [, ...] ] ) AS ( column_definition [, ...] )
[ LATERAL ] ROWS FROM( function_name ( [ argument [, ...] ] ) [ AS ( column_definition [, ...] ) ] [, ...] )
[ WITH ORDINALITY ] [ [ AS ] alias [ ( column_alias [, ...] ) ] ]
from_item [ NATURAL ] join_type from_item [ ON join_condition | USING ( join_column [, ...] ) [ AS join_using_alias ] ]
and grouping_element can be one of:
( )
expression
( expression [, ...] )
ROLLUP ( { expression | ( expression [, ...] ) } [, ...] )
CUBE ( { expression | ( expression [, ...] ) } [, ...] )
GROUPING SETS ( grouping_element [, ...] )
and with_query is:
with_query_name [ ( column_name [, ...] ) ] AS [ [ NOT ] MATERIALIZED ] ( select | values | insert | update | delete )
[ SEARCH { BREADTH | DEPTH } FIRST BY column_name [, ...] SET search_seq_col_name ]
[ CYCLE column_name [, ...] SET cycle_mark_col_name [ TO cycle_mark_value DEFAULT cycle_mark_default ] USING cycle_path_col_name ]
TABLE [ ONLY ] table_name [ * ]We can produce a more useful writing model by stripping the general SELECT statement down to its essentials:
-- writing model for a general SELECT statement
SELECT DISTINCT
column, AGG_FUNC(column_or_expression), ...
FROM
mytable M JOIN another_table A ON M.column = A.column
WHERE
constraint_expression
GROUP BY
column
HAVING
constraint_expression
ORDER BY
column ASC/DESC
LIMIT
returnCount OFFSET skipCount;Tabular summary of writing model:
| Order | Operation | Can use column alias? | Note |
|---|---|---|---|
1 | FROM/JOIN | ❌ | Declare table aliases here and FROM what table (or combined tables when using one or more JOINs) you will be selecting data; the total working set of data is determined here--the resultant working set of data may be thought of as a "virtual table" to which all subsequent operations will apply (this is the idea behind a view which is essentially a commonly used working set of data or "virtual table" created by a query, often a somewhat complicated one, whose result set is given a name (i.e., the name of the view); subsequent queries can then be easily written to target this predetermined working set of data, virtual table, named query, view or whatever you want to call it--such a construct is most often called a view) |
2 | WHERE | ❌ | Apply first-pass constraints to individual rows |
3 | GROUP BY | ❌ | Group remaining rows (i.e., the rows remaining after the first-pass WHERE constraints have filtered out all individual rows not meeting the specified criteria) based on common values in specified column(s); be mindful when using more than one field in this clause, namely GROUP BY ColOrExpr means put all those rows with the same value for ColOrExpr in one group while something like GROUP BY ColOrExpr1, ..., ColOrExprN means put all those rows with the same values for all N columns or expressions in one group |
4 | HAVING | ❌ | Apply constraints to an aggregate or a group of rows that have been GROUPed BY some criteria; essentially, WHERE applies to individual rows while HAVING applies to groups of rows |
5 | SELECT | ― | Declare column aliases here |
6 | DISTINCT | ― | Use a more restrictive version of SELECT by specifying which rows should be considered duplicates, and thus removed from the result set, by using the syntax SELECT DISTINCT which generally takes one of two forms:
|
7 | ORDER BY | ✅ | Sort a query result set by using the syntax ORDER BY SortExpression1 [ASC | DESC], ..., SortExpressionN [ASC | DESC] where SortExpression can be either a column or an expression and ASC | DESC specifies whether to sort the column or expression in ASCending order (default) or in DESCending order; note that you can sort a result set by multiple columns and/or expressions where each listed SortExpression takes precedence over the next (e.g., something like ORDER BY LastName DESC, Age ASC would order rows first by LastName in descending order and then, within that ordered set, sort the remaining results by AGE in ascending order) |
8 | LIMIT/OFFSET | ✅ | Return a (possibly offset) subset of rows generated by a query; the syntax LIMIT RowsToReturn OFFSET RowsToSkip indicates the maximum number of rows that should be returned by a query (i.e, RowsToReturn) as well as how many rows to skip from the initial query results (i.e., RowsToSkip); almost always use ORDER BY when using LIMIT/OFFSET to ensure a predictable result set--this is due to the fact that the order in which records are returned from a database is often random unless order is imposed by means of ORDER BY |
List summary of writing model:
FROM/JOIN: The FROM clause and subsequent JOINs are first executed to determine the total working set of data that is being queried. This includes subqueries in this clause, and can cause temporary tables to be created under the hood containing all the columns and rows of the tables being joined.WHERE: Once we have the total working set of data, the first-pass WHERE constraints are applied to the individual rows. The rows that do not satisfy the constraint are discarded. Each of the constraints can only access columns directly from the tables requested in the FROM clause. Aliases in the SELECT part of the query are not accessible in most databases since they may include expressions dependent on parts of the query that have not yet executed.GROUP BY: The remaining rows after the WHERE constraints are applied are then grouped based on common values in the column(s) specified in the GROUP BY clause. As a result of the grouping, there will only be as many rows as there are unique values in that column. Implicitly, this means that you should only need to use this when you have aggregate functions in your query.HAVING: If the query has a GROUP BY clause, then the constraints in the HAVING clause are then applied to the grouped rows, discarding the grouped rows that do not satisfy the constraint(s). Like the WHERE clause, aliases are also not accessible from this step in most databases. Finally, just remember that the WHERE clause is to individual rows what HAVING is to grouped rows by means of GROUP BY (i.e., HAVING is the multi-row version of WHERE).SELECT: Any expressions in the SELECT part of the query are finally computed.DISTINCT: Of the remaining rows, rows with duplicate values in the column(s) marked as DISTINCT will be discarded.ORDER BY: If an order is specified by the ORDER BY clause, the rows are then sorted by the specified data in either ascending or descending order. Since all the expressions in the SELECT part of the query have been computed, you can reference aliases in this clause.LIMIT/OFFSET: Finally, the rows that fall outside the range specified by the LIMIT and OFFSET are discarded, leaving the final set of rows to be returned from the query.As noted in [1], there is no agreed-upon standard for formatting SQL queries, and any guide for writing code is necessarily subjective. What is true objectively, however, is that your goal should be to communicate what your query is doing.
Formatting is important in order to accomplish this goal. It is not simply a cosmetic concern. Think about how difficult it would be to read text without punctuation, capitalization, paragraphs, indentation, etc. Less dramatically, a SQL query should be crafted so that its intention is clear; clarity should come at the cost of concision (except, of course, if this entails performance issues and the like).
For example, the following is a one-line solution to the first Leetcode database problem, 175. Combine Two Tables:
SELECT FirstName, LastName, City, State FROM Person LEFT JOIN Address USING(PersonId);This is a perfectly legitimate solution, but could it be written more clearly? Of course:
SELECT
P.FirstName,
P.LastName,
A.City,
A.State
FROM
Person P
LEFT JOIN Address A ON P.PersonId = A.PersonId;Adopting your own formatting style will inevitably take some time, but try to be judicious and consistent in the habits you form and the practices you choose. I share below different rules I have adopted that have been heavily influenced by [1], my everyday development work, and, of course, my experience in solving all of the Leetcode database problems. I have found following these rules to be helpful in clarifying my own mental model when writing queries, and I hope you find them to be similarly helpful. Take what serves you well and leave the rest.
Formatting Guidelines: Click the widget above to see the rationale behind everything that appears immediately below.
| CATEGORY | DESCRIPTION |
|---|---|
| Table aliases | Use table aliases that are abbreviations for the table name. It helps to define table aliases without the AS keyword so as to immediately distinguish table aliases from column aliases (see next line). |
| Column aliases | Use AS to define column aliases. Columns are generally qualified, meaning that they use table aliases. |
| Dot notation | Use dot notation (i.e., the membership operator .) in conjunction with table aliases to clearly indicate from which table a column is being accessed. For example, if you are trying to access the column UserId from the Users table, then the Users table can be aliased with U and the UserId column can be accessed via dot notation as follows: U.UserId. |
| Consistency | Be consistent in capitalization, in usage of underscores, indentation, etc. |
| Readability | Write the code to be understandable, so you and someone else can read it. |
| Table and column names | Always use only alphanumeric characters and underscores for table and column names. Other characters, such as spaces, require that the name be escaped when referenced. The escape characters, typically double quotes or square braces, make it hard to write and read queries. |
| Plurality of table names | Table names are usually in plural (this helps avoid the problem with reserved words) and reinforces the idea that tables contain multiple instances of the entity. |
| Singularity of primary key column names | The primary key is the singular of the table name followed by "Id." For example, OrderId and SubscriberId could be expected primary key column names for an Orders and Subscribers table, respectively. |
| Foreign key and primary key naming consistency | When a foreign key column references another table's primary key column, use the exact same name for both columns, ensuring consistency and making it easy to see relationships between tables. |
| CamelCase | "CamelBack" case is used (upper case for each new word, lowercase for the rest). Hence, OrderId instead of Order_Id. In general, table names and column names are not case sensitive. The CamelBack method is used to make it easier to read the name, while at the same time keeping the name shorter (than if using underscores). Most table and column names use CamelBack casing. |
| Underscore usage | The underscore is used for grouping common columns together. For instance, in a Calendar table, the indicators for holidays for specific religions might start with hol_. |
| Reserved words | Refrain from using SQL reserved words. Databases have their own special words, but words like ORDER, GROUP, and VALUES are keywords in the language and should be avoided. Most keywords are capitalized. |
| Left alignment of high-level clauses | The high-level clauses defined by the SQL language are all aligned on the left. These are WITH, SELECT, FROM, WHERE, GROUP BY, HAVING, and ORDER BY. |
| Alignment within clauses | Within a clause, subsequent lines are aligned after and (usually) underneath the keyword, so the scope of each clause is visually obvious. |
| Alignment within subqueries | Subqueries follow similar rules, so all the main clauses of a subquery are indented, but still aligned on the left. |
Alignment within FROM clause | Within the FROM clause, table names and subqueries start on a new line (the tables are then aligned and easier to see). The JOIN keywords (i.e., LEFT JOIN, INNER JOIN, etc.) start on their own line and the ON predicate immediately follows (i.e., appears on the same line as its associated JOIN keyword). |
| Operator spacing | Operators generally have spaces around them. |
| Comma placement | Commas are at the end of a line, just as a human would place them. |
| Parentheses across multiple lines | A closing parenthesis, when on a subsequent line, is aligned under the opening parenthesis. |
CASE statements and parentheses | CASE statements are always surrounded by parentheses. |
Some of the guidelines above are less relevant than others in the context of solving Leetcode problems (e.g., you have no choice concerning the naming of tables in whatever schema you are provided for a particular problem), but I have found most of them to be immensely useful in the process of cobbling together my own queries.
As always, rules are made to be broken. The rules above are meant to provide you freedom and clarity. They are not meant to be a straightjacket. I try to follow my own rules most of the time, but I occasionally make exceptions (especially when using UNION or UNION ALL for some reason) or slip up (more often than I care to admit). Adopt what you find to be helpful. Leave the rest.
[1].
Gordon Linoff.
Data Analysis Using SQL and Excel.
Wiley. 2016. 39.
[2].
Alan Beaulieu.
Learning SQL.
O'Reilly. 2020. 267.
[3].
Anthony Molinaro.
SQL Cookbook.
O'Reilly. 2021. 519.
[4].
Anthony Molinaro.
SQL Cookbook.
O'Reilly. 2021. 520.
[5].
Alan Beaulieu.
Learning SQL.
O'Reilly. 2020. 181.
[6].
Anthony Molinaro.
SQL Cookbook.
O'Reilly. 2021. 535-537.