D4: Performance & Querying Week 6

DAY 39

Clustering Keys & Materialized Views for Snowflake Performance

Day 38 added two serverless levers that left the warehouse alone. One offloaded a heavy scan. One built a search access path for point lookups. Today’s two tools change the data itself. A clustering key reorders rows so a query scans fewer micro-partitions. A materialized view stores a precomputed result so a repeated query skips the work entirely. Both run their maintenance on serverless credits. The exam tests when each pays off and which one each edition unlocks.

🗣️ Plain-English First

Five terms carry this sub-objective. The plain reading of two of them points away from the exam answer.

Term	What it sounds like	What it means in Snowflake
Clustering key	An index	A column or expression that tells Snowflake how to co-locate rows in micro-partitions. It is not an index. A table has one clustering key, which can list several columns.
Clustering depth	How deep the data goes	The average number of micro-partitions that overlap on the key values. A smaller average depth means better clustering. Read it as a health score where low is good.
Reclustering	Doing it all again	Snowflake rewriting micro-partitions to restore order after DML degrades it. It deletes and re-inserts the affected rows behind the scenes.
Automatic Clustering	An Enterprise add-on	The background service that reclusters a clustered table for you. It uses serverless compute, needs no warehouse, and runs on Standard Edition.
Materialized view	A saved query	A precomputed result stored like a table and refreshed for you when the base table changes. It needs Enterprise Edition. A plain view stores nothing.

📘

Clustering Keys

What a clustering key does

A clustering key co-locates rows with similar key values in the same micro-partitions. A query that filters on the key then reads fewer micro-partitions. Snowflake skips, or prunes, the ones whose value range cannot match. Less data scanned means a faster, cheaper query.

This matters only for large tables. Snowflake guidance points clustering at tables in the multi-terabyte range. A small table already prunes well from its natural load order. Clustering it adds maintenance cost for no real gain.

Which column to cluster on

The exam tests a fixed priority order. Snowflake documentation ranks the choices like this:

Columns used in selective WHERE filters come first. A date column on a fact table is the classic pick.
Columns used in JOIN predicates come next, if there is room for another key column.
Columns used only in GROUP BY or ORDER BY are usually less helpful. Favor the filter and join columns over these.

Candidates I have trained read GROUP BY in a stem and cluster on that column. The documentation places GROUP BY below WHERE and JOIN. When a question lists a WHERE column and a GROUP BY column, the WHERE column wins.

Cardinality matters too. The key needs enough distinct values to prune well. It also needs few enough that rows still group. A status flag with three values prunes almost nothing. A raw unique id scatters too far, so Snowflake suggests an expression to lower its cardinality.

Reading clustering health

Check how well a table is clustered with SYSTEM$CLUSTERING_INFORMATION. Pass it a table name and a column. It returns a JSON report.

SQL

SELECT SYSTEM$CLUSTERING_INFORMATION('day10_orders', '(O_ORDERDATE)');

The headline field is average_depth. A smaller value means the table is better clustered on that column. The report also lists the micro-partition count, the average overlaps, and notes on whether clustering would help. Read the depth as a score where low is the goal.

Defining a key uses CREATE TABLE ... CLUSTER BY or ALTER TABLE ... CLUSTER BY. Drop it with ALTER TABLE ... DROP CLUSTERING KEY. One note from the docs: a CREATE TABLE AS SELECT does not carry a clustering key over. You add the key after the copy exists.

⚙️

Automatic Clustering

What runs the maintenance

Once a clustering key exists, DML slowly degrades the order. Inserts, updates, and deletes scatter rows across micro-partitions. Automatic Clustering is the service that reclusters the table to restore order. It runs in the background with no action from you.

The service does not use your virtual warehouse. It runs on serverless compute that Snowflake manages, billed by the second. Snowflake reclusters only when the table will benefit, not on a fixed schedule. Heavy DML means more reclustering and a higher bill.

The edition trap

Here is the distinction the exam leans on. Clustering and Automatic Clustering work on Standard Edition. Materialized views and the Search Optimization Service from Day 38 require Enterprise Edition or higher.

In production projects I have worked on, this trap catches people. They assume Automatic Clustering needs Enterprise, because search optimization does. It does not. A stem that puts an account on Standard Edition still allows clustering. It rules out a materialized view.

Performance feature	Edition required	Maintenance cost
Clustering key and Automatic Clustering	Standard Edition or higher	Serverless credits, no warehouse
Materialized views	Enterprise Edition or higher	Serverless credits, no warehouse
Search Optimization Service (Day 38)	Enterprise Edition or higher	Serverless credits plus storage

All three bill as serverless add-ons. The edition gate is the part that separates them on the exam.

🧮

Materialized Views for Performance

What it stores

A materialized view precomputes the result of its query and stores it like a table. A plain view stores nothing and runs its query on every read. The materialized form trades storage and maintenance for faster reads. It behaves more like a table than like a view.

It suits a heavy aggregation that runs often on data that changes slowly. A daily count by status over a large table is a fit. The result is small. The query repeats. The base data is steady.

It stays current on its own

Snowflake maintains the materialized view for you. A background service updates it after the base table changes. You never refresh it by hand. Data you read through it is always current, whatever DML hit the base table.

That refresh runs on serverless credits, like Automatic Clustering. The trade is real. A base table with constant DML drives constant refresh. The maintenance cost can then outweigh the read savings.

The limits that get tested

A materialized view has restrictions a plain view does not. The exam draws distractors straight from this list:

It can query only a single table. Joins, including self-joins, are not supported.
It cannot reference another view, a materialized view, or a user-defined table function.
The SELECT cannot use a HAVING clause, an ORDER BY, or a LIMIT.
Only a limited set of aggregate functions is allowed. Some of those cannot be used as window functions.

The single-table rule is the one a question is most likely to target. If a stem needs a join across two tables, a materialized view is wrong before you read the other options. A plain view or a separate table handles the join.

Need in the stem	Right tool	Why
Repeated heavy aggregation on slow-changing data	Materialized view	Stores the precomputed result, refreshed for you
Large filtered range scan or a join on a key	Clustering key	Prunes micro-partitions on the filtered column (this day)
Selective point lookup on a high-cardinality column	Search optimization	Builds a search access path (Day 38)
Simplify a query, including a join, with no stored result	Plain view	Stores nothing, supports joins, runs on every read

⚡

Cheat Sheet

Concept	What to remember	Exam keyword
Cluster column priority	WHERE filter columns first, then JOIN columns. GROUP BY and ORDER BY are less helpful	“WHERE before JOIN”
When to cluster	Large multi-terabyte tables with selective filters. Not small tables	“very large table”
Clustering depth	Average overlap on the key. Smaller is better clustered	“lower depth, better”
SYSTEM$CLUSTERING_INFORMATION	Reports average depth and clustering health as JSON	“clustering information”
Automatic Clustering edition	Standard Edition. Serverless, no warehouse	“Standard”
Materialized view edition	Enterprise Edition or higher	“Enterprise”
Materialized view refresh	Maintained automatically. Reads are always current	“always current”
Materialized view single-table rule	One base table only. No joins, no self-joins	“single table”
Materialized view vs plain view	Materialized stores the result. Plain view stores nothing	“precomputed”
Cost model	Clustering and materialized view maintenance bill as serverless credits	“serverless”

🎯

Exam Tip

🎯 Exam Tip

The cluster-column order is a known item. A question hands you a Query Profile and asks which column to cluster on. The answer ranks WHERE filter columns first and JOIN columns second. A GROUP BY or ORDER BY column is a distractor placed to catch the candidate who clusters on whatever appears in the query. Pick the filter column.

The edition split is the second trap. Clustering and Automatic Clustering run on Standard Edition. Materialized views need Enterprise. If a stem sets the scene on Standard Edition, a materialized view option is wrong. A clustering option is still valid.

The single-table rule decides most materialized view questions. A materialized view reads one base table and supports no joins. A stem that needs two tables joined points to a plain view or a precomputed table, never a materialized view.

🛠️

Hands-On Lab

Type: LAB (reads a shared table, builds one throwaway copy and one materialized view) | Time: ~15 minutes | Credits: <0.3 | Prerequisite: an Enterprise-edition trial (materialized views need Enterprise), the shared lab_xs warehouse, and the day10_orders table from Day 10. This lab reads day10_orders for one report only. It never alters it. All clustering and view work happens on a private copy.

Read clustering health on the shared table. This function only reports. It changes nothing, so it is safe to run against day10_orders.

SQL

USE WAREHOUSE lab_xs;

SELECT SYSTEM$CLUSTERING_INFORMATION('day10_orders', '(O_ORDERDATE)');

👀 Observe: Read average_depth in the JSON. On a table loaded in roughly date order, the depth on O_ORDERDATE is already low. That is the point: natural load order can cluster a column well before you define any key.

Make a private copy to experiment on. A CREATE TABLE AS SELECT does not carry a clustering key, so the copy starts with none.

SQL

CREATE OR REPLACE TABLE clust_demo AS
SELECT * FROM day10_orders;

👀 Observe: The table clust_demo is a throwaway you own. Every change from here lands on this copy. The shared day10_orders table is untouched.

Define a clustering key on the copy. Cluster on the column you would filter in a WHERE clause.

SQL

ALTER TABLE clust_demo CLUSTER BY (O_ORDERDATE);

SELECT SYSTEM$CLUSTERING_INFORMATION('clust_demo', '(O_ORDERDATE)');

👀 Observe: Reclustering runs in the background and is not instant. On a copy this small the depth barely moves, because the data was already ordered. A dramatic drop needs a large, disordered table. The workflow is the lesson here, not the number.

Build a materialized view on the copy. Precompute a count by status, a small result over a large table.

SQL

CREATE MATERIALIZED VIEW mv_demo AS
SELECT O_ORDERSTATUS, COUNT(*) AS order_count
FROM clust_demo
GROUP BY O_ORDERSTATUS;

👀 Observe: If this fails with an edition error, your trial is not Enterprise. That failure is the edition rule in action. Materialized views need Enterprise, clustering does not.

Query the view and confirm it stays current. Read it, change the base, then read it again.

SQL

SELECT * FROM mv_demo ORDER BY order_count DESC;

DELETE FROM clust_demo WHERE O_ORDERSTATUS = 'P';

SELECT * FROM mv_demo ORDER BY order_count DESC;

👀 Observe: The second read drops the ‘P’ rows. You never refreshed the view. Snowflake updated it for you after the DELETE. That is why the read is always current.

List the materialized view and check its source. SHOW reports the base table behind each view.

SQL

SHOW MATERIALIZED VIEWS LIKE 'mv_demo';

👀 Observe: The output names clust_demo as the single base table. The single-table link is visible right there in the catalog.

✓

Cleanup. Drop the view and the copy, then suspend the warehouse. The shared table needs no cleanup because nothing touched it.

SQL

DROP MATERIALIZED VIEW IF EXISTS mv_demo;
DROP TABLE IF EXISTS clust_demo;
ALTER WAREHOUSE lab_xs SUSPEND;
-- Keep lab_xs itself. It is the shared warehouse reused across days.
-- day10_orders is read-only in this lab and is left as is.

❓

Practice Questions

Q1: Based on a review of a Query Profile, which scenarios will benefit the MOST from the use of a data clustering key? (Select TWO.) ▸

Options:

A. A column that appears most frequently in ORDER BY operations
B. A column that appears most frequently in WHERE operations
C. A column that appears most frequently in GROUP BY operations
D. A column that appears most frequently in AGGREGATE operations
E. A column that appears most frequently in JOIN operations

✅ Answer: B, E

Why B: Snowflake ranks columns used in selective WHERE filters as the first choice for a clustering key. Filtering on the key is what drives micro-partition pruning.

Why E: JOIN predicate columns are the next priority when there is room for another key column.

Why not A: ORDER BY columns are usually less helpful than filter or join columns, per the docs.

Why not C: GROUP BY sits below WHERE and JOIN in the priority order. It is the classic distractor here.

Why not D: A column used only in an aggregate is not a primary clustering driver.

Q2: An account runs on Standard Edition. Which performance feature can it use as described? ▸

Options:

A. A materialized view that precomputes a daily aggregation
B. The Search Optimization Service for point lookups
C. A clustering key with Automatic Clustering maintenance
D. None of these, because all three need Enterprise Edition

✅ Answer: C

Why C: Clustering keys and the Automatic Clustering service work on Standard Edition. They run on serverless compute and need no edition upgrade.

Why not A: Materialized views require Enterprise Edition or higher. Standard cannot create one.

Why not B: Search optimization also requires Enterprise Edition. Re-read Day 38 if missed.

Why not D: This answer assumes all three share the Enterprise gate. Clustering is the exception that breaks the pattern.

Q3: A 200 MB lookup table is queried often and rewritten in full every night. A developer wants to add a clustering key to speed up reads. What is the BEST assessment? ▸

Options:

A. Clustering is a good fit because the table is queried often
B. Clustering is a poor fit because the table is small and heavily rewritten
C. Clustering will reduce the table’s storage cost
D. Clustering is required before a materialized view can be built on the table

✅ Answer: B

Why B: Clustering targets very large, multi-terabyte tables with selective filters. A small table prunes well on its own. Nightly full rewrites would also force constant reclustering cost.

Why not A: Query frequency alone does not justify clustering. Table size and filter selectivity decide it.

Why not C: Clustering aims at query pruning, not storage reduction. Reclustering can even add a little storage.

Why not D: A materialized view has no clustering prerequisite. The two features are independent.

Q4: A report joins an orders table to a customers table and runs many times a day. A team proposes a materialized view to precompute the joined result. Why will this not work? ▸

Options:

A. Materialized views cannot be queried more than once per day
B. A materialized view can reference only a single base table and does not support joins
C. Materialized views do not support COUNT or SUM
D. Materialized views require the base tables to be clustered first

✅ Answer: B

Why B: A materialized view reads exactly one base table. Joins, including self-joins, are not supported. A two-table join rules it out at once.

Why not A: There is no per-day query limit. Frequent reads are the use case materialized views are built for.

Why not C: A limited set of aggregates is supported. Common ones like COUNT work. The join is the blocker, not the aggregate.

Why not D: Materialized views have no clustering prerequisite.

Q5: A dashboard runs the same heavy aggregation over a single large table whose data changes only once a day. The team wants the fastest reads without refreshing anything by hand. Which approach fits BEST? ▸

Options:

A. A plain (non-materialized) view over the aggregation
B. A materialized view over the aggregation
C. The Search Optimization Service on the table
D. A larger warehouse for the dashboard

✅ Answer: B

Why B: A repeated aggregation on a single, slow-changing table is the materialized view use case. The result is precomputed and stays current through automatic serverless refresh.

Why not A: A plain view stores nothing and reruns the full aggregation on every read. It saves no compute.

Why not C: Search optimization targets selective point lookups, not a full aggregation. Re-read Day 38 if missed.

Why not D: A bigger warehouse runs the same work faster but still reruns it every time. It precomputes nothing.

❄️

Snowflake Documentation

Official docs for today’s topics. The exam pulls directly from these.

❄️

Clustering Keys and Clustered Tables

docs.snowflake.com/en/user-guide/tables-clustering-keys

→

❄️

Automatic Clustering

docs.snowflake.com/en/user-guide/tables-auto-reclustering

→

❄️

Working with Materialized Views

docs.snowflake.com/en/user-guide/views-materialized

→

❄️

Optimizing Storage for Performance

docs.snowflake.com/en/user-guide/performance-query-storage

→

📝 Recap

Today you learned: two ways to change the data so a query does less work. A clustering key co-locates rows so a filtered query prunes micro-partitions. The priority order is WHERE filter columns first, then JOIN columns. GROUP BY and ORDER BY columns are less helpful. Clustering depth scores the result, where a smaller average depth is better. Automatic Clustering maintains the order on serverless compute and runs on Standard Edition. A materialized view precomputes a result over a single table, stays current through automatic refresh, and needs Enterprise Edition.

Key takeaway: match the tool to the stem. A large filtered scan or a join points to a clustering key. A repeated aggregation on slow-changing single-table data points to a materialized view. Read the edition in the stem: clustering works on Standard, materialized views need Enterprise. A join in the stem rules a materialized view out, because it reads only one base table.

Tomorrow (Day 40): Caching: Result, Metadata, and Warehouse. You meet the three cache layers and when each one answers a query for free. Carry in two rules. The result cache lasts 24 hours from last use. The warehouse cache is cleared when the warehouse suspends.

← Day 38 All 50 Days Day 40 →

About the Author

Abhay Krishnan

Senior Data & AI Consultant

Connect on LinkedIn

With over five years of data engineering experience at EY and Infosys, Abhay Krishnan specializes in building scalable data pipelines and cloud warehousing solutions. He is a certified SnowPro Core professional, alongside credentials in AWS and Azure. Abhay created this 50-day track to solve a problem he faced firsthand: the lack of a structured, free resource for Snowflake certification prep. Follow him on LinkedIn for more data engineering insights.

Clustering Keys & Materialized Views for Snowflake Performance

Clustering Keys

What a clustering key does

Which column to cluster on

Reading clustering health

Automatic Clustering

What runs the maintenance

The edition trap

Materialized Views for Performance

What it stores

It stays current on its own

The limits that get tested

Cheat Sheet

Exam Tip

Hands-On Lab

Practice Questions

Snowflake Documentation

Topics