D4: Performance & Querying Week 6

DAY 41

Snowflake SQL – Aggregate, Window Functions & QUALIFY

Day 40 skipped work with three caches. Today the work is the query itself. Aggregate functions like COUNT and SUM collapse many rows into one summary. Window functions run the same kind of math but keep every row. The OVER clause is the switch between them. You meet ROW_NUMBER, RANK, DENSE_RANK, LEAD, and LAG, then learn LISTAGG for joining values into one string. The closing piece is QUALIFY. It filters the output of a window function in a single statement, the way HAVING filters groups. Carry in the two rules from yesterday’s teaser. QUALIFY does for window functions what HAVING does for GROUP BY. RANK skips numbers after a tie, while DENSE_RANK does not.

🗣️ Plain-English First

This sub-objective brings new SQL words. Each one sounds like plain English but has a precise meaning the exam holds you to.

Term	What it sounds like	What it means in Snowflake
Window function	A function with a view	A function that calculates over a set of related rows and returns one value per input row. The rows stay. Nothing collapses.
OVER clause	Something on top	The clause that defines the window. It turns an ordinary aggregate into a window function and holds PARTITION BY and ORDER BY.
PARTITION BY	Splitting into parts	Inside OVER, it divides rows into groups. The window calculation restarts for each group. It is the window twin of GROUP BY.
QUALIFY	To meet a standard	A clause that filters rows based on a window-function result. It runs after the window functions are computed, so WHERE cannot do its job.
RANK vs DENSE_RANK	Two ranking words	Both give tied rows the same number. RANK then leaves a gap before the next value. DENSE_RANK leaves no gap.
LISTAGG	A list of things	An aggregate that joins a column’s values into one delimited string. The default delimiter is an empty string, not a comma.

📘

Aggregate Functions and the OVER Clause

Two ways to summarise rows

An aggregate function takes many rows and returns one value. COUNT, SUM, AVG, MIN, MAX, and LISTAGG all work this way. Pair them with GROUP BY and you get one summary row per group. A query grouped by customer returns one total per customer. The detail rows are gone from the result.

That collapse is the limit. You often want the summary placed next to every detail row, not in place of it. A running total per customer needs both the order line and the total beside it. GROUP BY cannot do that. The window function can.

The OVER clause keeps every row

A window function runs a calculation across a set of related rows and returns one value for each input row. The OVER clause defines that set. Add OVER to an aggregate and it stops collapsing. SUM(O_TOTALPRICE) OVER (PARTITION BY O_CUSTKEY) returns each customer’s total beside every one of that customer’s orders. The row count of the query does not change.

Behaviour	Aggregate with GROUP BY	Window function with OVER
Rows returned	One per group	One per input row
How it groups	GROUP BY	PARTITION BY inside OVER
Detail rows	Collapsed away	Preserved
Typical use	A summary report	A running total, a rank, a row offset

The exam phrases this as a choice. A stem wants a total shown on every detail row. GROUP BY is the trap answer because it drops the detail. The window form keeps it.

LISTAGG joins values into one string

LISTAGG concatenates the values of a column into a single delimited string. It is an aggregate, so it collapses a group unless you add OVER. The syntax names the column first, then an optional delimiter: LISTAGG(O_ORDERSTATUS, ','). An optional WITHIN GROUP (ORDER BY ...) sets the order of items inside the string.

Two details get tested. The default delimiter is an empty string. Leave the delimiter out and the values run together with nothing between them. A comma appears only when you pass one. The other detail is DISTINCT. It drops duplicate values before joining. The output has a 16 MB ceiling. Past that, the query errors unless you add ON OVERFLOW TRUNCATE.

🔢

Ranking and Row-Navigation Functions

Three ways to number rows

Three functions number rows within an ordered window. They differ only in how they treat ties. ROW_NUMBER gives every row a unique number and ignores ties. RANK gives tied rows the same number, then skips ahead. DENSE_RANK gives tied rows the same number and skips nothing.

Take four rows scored 100, 90, 90, 80, ordered high to low. The two 90s tie. Here is what each function returns.

Score	ROW_NUMBER	RANK	DENSE_RANK
100	1	1	1
90	2	2	2
90	3	2	2
80	4	4	3

Why the tie rule is tested

Read the last row. ROW_NUMBER reaches 4 because it never ties. RANK reaches 4 because it gave both 90s a 2, then jumped over 3. DENSE_RANK reaches 3 because it gave both 90s a 2, then continued with no gap. This is one of the cleaner distinctions in the SQL part of the exam. A stem often hands you a tie and asks for the exact sequence. RANK produces 1, 2, 2, 4. DENSE_RANK produces 1, 2, 2, 3.

PARTITION BY restarts the count

PARTITION BY inside OVER splits the rows into groups. The numbering restarts at 1 for each group. ROW_NUMBER() OVER (PARTITION BY O_CUSTKEY ORDER BY O_TOTALPRICE DESC) ranks each customer’s orders from the top, starting fresh per customer. Drop the PARTITION BY and you rank the whole table as one set.

The ORDER BY inside OVER is not optional for these functions. It sets the order that the ranking reads. Without it there is no defined first row, so the rank is meaningless. The ORDER BY inside OVER is separate from any ORDER BY that sorts the final result.

LEAD and LAG look across rows

Two functions read a value from a different row in the window. LAG reads backward to a preceding row. LEAD reads forward to a following row. Both need an ORDER BY to know what preceding and following mean. The default offset is one row. LAG(O_TOTALPRICE) returns the prior row’s price. LEAD(O_TOTALPRICE) returns the next row’s price. At the edge of the window, where no such row exists, the result is NULL unless you supply a default.

🎯

QUALIFY: Filtering on a Window Function

Why WHERE cannot do this

You rank orders per customer, then want only the top one from each. The natural move is WHERE rank = 1. It fails. A window function cannot appear in WHERE. The clause order is the reason. WHERE runs before window functions are computed, so the rank does not exist yet when WHERE is evaluated.

The old fix was a subquery or a CTE. Compute the rank in an inner query, then filter the rank in an outer query. It works but adds a layer of nesting to read.

QUALIFY is to windows what HAVING is to groups

QUALIFY filters rows based on the result of a window function. It runs after the window functions are computed, so it can see the rank that WHERE could not. The parallel is exact. HAVING filters groups after GROUP BY. QUALIFY filters rows after the window functions. One handles aggregates, the other handles windows.

QUALIFY needs at least one window function present. That function can sit in the SELECT list or directly in the QUALIFY predicate. A QUALIFY clause with no window function anywhere is an error.

Top-N per group in one statement

The pattern the exam rewards is top-N per group with no subquery. Rank inside the window, filter the rank in QUALIFY.

SQL

SELECT O_CUSTKEY, O_ORDERKEY, O_TOTALPRICE
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS
QUALIFY ROW_NUMBER() OVER (
          PARTITION BY O_CUSTKEY
          ORDER BY O_TOTALPRICE DESC) = 1;

That returns the single highest-priced order per customer, whole row intact. Change the predicate to <= 3 for the top three. Swap ROW_NUMBER for RANK if you want ties included.

What the exam expects you to know

QUALIFY is not part of the ANSI SQL standard. The exam treats it as a Snowflake feature. A question asking how to get top-N per group without a subquery is pointing at QUALIFY. The other choices in such a stem usually misuse WHERE or HAVING. Both fail to reach a window-function result. In training sessions I have run, candidates reach for a nested subquery on this question because that is the habit from other databases. The single-statement QUALIFY answer is the one the exam wants.

⚡

Cheat Sheet

Concept	What to remember	Exam keyword
Aggregate vs window	Aggregate collapses rows. A window function keeps every row and adds a value	“one value per row”
OVER clause	Turns an aggregate into a window function. Holds PARTITION BY and ORDER BY	“OVER ( )”
PARTITION BY	Splits rows into groups inside the window. Numbering restarts per group	“restarts per partition”
ROW_NUMBER	Unique number per row. Ignores ties entirely	“always unique”
RANK on ties	Ties share a number, then a gap follows: 1, 2, 2, 4	“skips after a tie”
DENSE_RANK on ties	Ties share a number, no gap: 1, 2, 2, 3	“no gap”
LEAD vs LAG	LEAD reads the next row, LAG the previous. Default offset 1, NULL at edges	“next vs previous”
LISTAGG delimiter	Default is an empty string, not a comma. WITHIN GROUP sets the order	“empty-string default”
QUALIFY purpose	Filters window-function output after it is computed. HAVING for windows	“filter the rank”
Window in WHERE	Not allowed. WHERE runs before window functions. Use QUALIFY	“WHERE runs first”

🎯

Exam Tip

🎯 Exam Tip

The tie sequence is the most common ranking question. Given a tie, RANK returns 1, 2, 2, 4 and DENSE_RANK returns 1, 2, 2, 3. ROW_NUMBER ignores the tie and returns 1, 2, 3, 4. Read which sequence the stem shows and match the function. A stem that wants no gaps is DENSE_RANK.

For top-N per group, the answer is QUALIFY with a ranking function. A choice using WHERE rank <= N is the trap, because a window function cannot live in WHERE. The clause order puts WHERE before the window calculation. HAVING is also wrong here, since it filters groups, not window output.

Watch the LISTAGG delimiter. A question may show LISTAGG(col) with no delimiter and ask for the output. The values join with no separator, because the default delimiter is an empty string. Answers that insert commas are wrong unless a comma was passed.

LEAD and LAG point in opposite directions. LEAD reads forward to the next row. LAG reads back to the previous one. A stem describing “the following period” is LEAD. “The prior period” is LAG.

🛠️

Hands-On Lab

Type: LAB (read-only against the shared sample database, plus one inline VALUES query) | Time: ~10 minutes | Credits: <0.2 | Prerequisite: any edition trial, the shared lab_xs warehouse, and read access to SNOWFLAKE_SAMPLE_DATA. This lab reads sample tables only. It creates and drops nothing.

See the three ranking functions split on a tie. This inline query uses fixed values so the tie is guaranteed.

SQL

USE WAREHOUSE lab_xs;

SELECT score,
       ROW_NUMBER() OVER (ORDER BY score DESC) AS rn,
       RANK()       OVER (ORDER BY score DESC) AS rnk,
       DENSE_RANK() OVER (ORDER BY score DESC) AS drnk
FROM VALUES (100), (90), (90), (80) AS t(score)
ORDER BY score DESC;

👀 Observe: The two 90 rows tie. ROW_NUMBER gives them 2 and 3. RANK gives both 2, then jumps to 4. DENSE_RANK gives both 2, then continues at 3. That is the exact sequence the exam tests.

Get the top order per customer with QUALIFY. One statement, no subquery.

SQL

SELECT O_CUSTKEY, O_ORDERKEY, O_TOTALPRICE
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS
QUALIFY ROW_NUMBER() OVER (
          PARTITION BY O_CUSTKEY
          ORDER BY O_TOTALPRICE DESC) = 1
LIMIT 50;

👀 Observe: One row per customer, each the highest-priced order for that customer. The whole row survives, including O_ORDERKEY. A GROUP BY with MAX would have dropped that column.

Compare each order to the previous one with LAG. Order by date, then read backward one row.

SQL

SELECT O_ORDERDATE,
       O_TOTALPRICE,
       LAG(O_TOTALPRICE) OVER (ORDER BY O_ORDERDATE) AS prev_price
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS
ORDER BY O_ORDERDATE
LIMIT 20;

👀 Observe: Each row shows the prior order’s price beside its own. The very first row has a NULL in prev_price, because there is no earlier row. Swap LAG for LEAD and the NULL moves to the last row instead.

Join status values per customer with LISTAGG. Note the explicit comma delimiter.

SQL

SELECT O_CUSTKEY,
       LISTAGG(O_ORDERSTATUS, ',')
         WITHIN GROUP (ORDER BY O_ORDERDATE) AS statuses
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS
GROUP BY O_CUSTKEY
LIMIT 10;

👀 Observe: Each customer’s order statuses join into one string, comma-separated, in date order. Remove the ',' argument and rerun. The values now run together with no separator, because the default delimiter is an empty string.

✓

Cleanup. Suspend the warehouse. Nothing was created, so there is no object to drop.

SQL

ALTER WAREHOUSE lab_xs SUSPEND;
-- Keep lab_xs itself. It is the shared warehouse reused across days.
-- SNOWFLAKE_SAMPLE_DATA is read-only and was never modified.

❓

Practice Questions

Q1: You need the single highest-priced order per customer, returned as a full row, in one query with no subquery or CTE. Which clause filters the ranked rows? ▸

Options:

A. WHERE ROW_NUMBER() OVER (…) = 1
B. HAVING ROW_NUMBER() OVER (…) = 1
C. QUALIFY ROW_NUMBER() OVER (…) = 1
D. GROUP BY O_CUSTKEY with MAX(O_TOTALPRICE)

✅ Answer: C

Why C: QUALIFY filters window-function output and runs after the window functions compute. It does this in one statement with no nesting.

Why not A: WHERE runs before window functions, so a window function placed there raises an error.

Why not B: HAVING filters groups after GROUP BY. It does not see window-function results.

Why not D: GROUP BY with MAX returns the top price but drops other columns of that order, such as O_ORDERKEY. The stem wants the whole row.

Q2: Four rows are ordered by a score with values 100, 90, 90, 80. What sequence does RANK() assign? ▸

Options:

A. 1, 2, 2, 3
B. 1, 2, 3, 4
C. 1, 2, 2, 4
D. 1, 1, 2, 3

✅ Answer: C

Why C: RANK gives tied rows the same number, then skips the values it passed over. The two 90s share rank 2, so the 80 row lands on 4.

Why not A: That is the DENSE_RANK output. DENSE_RANK leaves no gap after a tie.

Why not B: ROW_NUMBER produces this. It assigns a unique number and ignores the tie.

Why not D: No ranking function starts a fresh count partway down. The first row is always 1.

Q3: A query lists orders by date and needs each row to show the price of the chronologically next order. Which function returns it? ▸

Options:

A. LAG(O_TOTALPRICE) OVER (ORDER BY O_ORDERDATE)
B. LEAD(O_TOTALPRICE) OVER (ORDER BY O_ORDERDATE)
C. FIRST_VALUE(O_TOTALPRICE) OVER (ORDER BY O_ORDERDATE)
D. NTH_VALUE(O_TOTALPRICE, 1) OVER (ORDER BY O_ORDERDATE)

✅ Answer: B

Why B: LEAD reads forward to a following row. With its default offset of one, it returns the next order’s price.

Why not A: LAG reads backward. It returns the previous order’s price, not the next.

Why not C: FIRST_VALUE returns the first row of the window. That same value repeats for every row.

Why not D: NTH_VALUE with 1 returns the first row’s value, like FIRST_VALUE, not the following row.

Q4: Which TWO statements about LISTAGG in Snowflake are correct? (Select TWO.) ▸

Options:

A. With no delimiter argument, LISTAGG joins values with a comma by default
B. With no delimiter argument, LISTAGG joins values with an empty string by default
C. WITHIN GROUP (ORDER BY …) sets the order of the concatenated values
D. LISTAGG returns one row per input row, like a ranking window function
E. DISTINCT inside LISTAGG is ignored and changes nothing

✅ Answer: B, C

Why B: The documented default delimiter is an empty string. A comma appears only when you pass one.

Why C: The WITHIN GROUP order clause fixes the sequence of items inside the joined string.

Why not A: This is the common assumption. The default produces no separator at all, not a comma.

Why not D: LISTAGG is an aggregate. It collapses a group into one string unless paired with OVER as a window function.

Why not E: DISTINCT removes duplicate values before joining, so it changes the output.

Q5: A query ranks employees by salary within each department, then tries WHERE rank <= 2 to keep the top two per department. The query errors. Why? ▸

Options:

A. PARTITION BY is not allowed together with RANK
B. WHERE runs before window functions are computed, so it cannot reference the rank
C. RANK can only be used inside a GROUP BY query
D. The query needs DENSE_RANK, not RANK, before WHERE will work

✅ Answer: B

Why B: Window functions are evaluated after WHERE. The rank does not exist yet when WHERE runs. QUALIFY filters it instead.

Why not A: PARTITION BY is exactly how you rank within each department. It is required here, not forbidden.

Why not C: RANK needs OVER, not GROUP BY. It runs per row over a window, not per group.

Why not D: The function choice does not change clause order. DENSE_RANK in WHERE fails for the same reason.

❄️

Snowflake Documentation

Official docs for today’s topics. The exam pulls directly from these.

❄️

QUALIFY

docs.snowflake.com/en/sql-reference/constructs/qualify

→

❄️

Window Functions

docs.snowflake.com/en/sql-reference/functions-analytic

→

❄️

RANK and DENSE_RANK

docs.snowflake.com/en/sql-reference/functions/rank

→

❄️

LISTAGG

docs.snowflake.com/en/sql-reference/functions/listagg

→

📝 Recap

Today you learned: how aggregate and window functions differ, then how QUALIFY filters window output. An aggregate with GROUP BY collapses rows to one per group. A window function adds the OVER clause and keeps every row. ROW_NUMBER numbers rows uniquely and ignores ties. RANK gives ties the same number, then skips, for 1, 2, 2, 4. DENSE_RANK gives ties the same number with no skip, for 1, 2, 2, 3. PARTITION BY restarts the count per group. LEAD reads the next row, LAG the previous, with a default offset of one. LISTAGG joins a column into one string, with an empty-string default delimiter.

Key takeaway: a window function cannot sit in WHERE, because WHERE runs first. QUALIFY filters the window result the way HAVING filters groups. The top-N per group answer is QUALIFY with a ranking function, in one statement, no subquery. For a tie, match the sequence to the function: a gap means RANK, no gap means DENSE_RANK.

Tomorrow (Day 42): Semi-Structured Data with VARIANT, FLATTEN, and JSON paths. You load JSON into a VARIANT column, read nested keys with dot and bracket notation, then explode arrays into rows with LATERAL FLATTEN. Carry in one rule. A JSON column name is case-insensitive, but a JSON key is case-sensitive.

← Day 40 All 50 Days Day 42 →

About the Author

Abhay Krishnan

Senior Data & AI Consultant

Connect on LinkedIn

With over five years of data engineering experience at EY and Infosys, Abhay Krishnan specializes in building scalable data pipelines and cloud warehousing solutions. He is a certified SnowPro Core professional, alongside credentials in AWS and Azure. Abhay created this 50-day track to solve a problem he faced firsthand: the lack of a structured, free resource for Snowflake certification prep. Follow him on LinkedIn for more data engineering insights.

Snowflake SQL – Aggregate, Window Functions & QUALIFY

Aggregate Functions and the OVER Clause

Two ways to summarise rows

The OVER clause keeps every row

LISTAGG joins values into one string

Ranking and Row-Navigation Functions

Three ways to number rows

Why the tie rule is tested

PARTITION BY restarts the count

LEAD and LAG look across rows

QUALIFY: Filtering on a Window Function

Why WHERE cannot do this

QUALIFY is to windows what HAVING is to groups

Top-N per group in one statement

What the exam expects you to know

Cheat Sheet

Exam Tip

Hands-On Lab

Practice Questions

Snowflake Documentation

Topics