Clustering Keys & Materialized Views for Snowflake Performance
Day 38 added two serverless levers that left the warehouse alone. One offloaded a heavy scan. One built a search access path for point lookups. Today’s two tools change the data itself. A clustering key reorders rows so a query scans fewer micro-partitions. A materialized view stores a precomputed result so a repeated query skips the work entirely. Both run their maintenance on serverless credits. The exam tests when each pays off and which one each edition unlocks.
Five terms carry this sub-objective. The plain reading of two of them points away from the exam answer.
| Term | What it sounds like | What it means in Snowflake |
|---|---|---|
| Clustering key | An index | A column or expression that tells Snowflake how to co-locate rows in micro-partitions. It is not an index. A table has one clustering key, which can list several columns. |
| Clustering depth | How deep the data goes | The average number of micro-partitions that overlap on the key values. A smaller average depth means better clustering. Read it as a health score where low is good. |
| Reclustering | Doing it all again | Snowflake rewriting micro-partitions to restore order after DML degrades it. It deletes and re-inserts the affected rows behind the scenes. |
| Automatic Clustering | An Enterprise add-on | The background service that reclusters a clustered table for you. It uses serverless compute, needs no warehouse, and runs on Standard Edition. |
| Materialized view | A saved query | A precomputed result stored like a table and refreshed for you when the base table changes. It needs Enterprise Edition. A plain view stores nothing. |
Clustering Keys
What a clustering key does
A clustering key co-locates rows with similar key values in the same micro-partitions. A query that filters on the key then reads fewer micro-partitions. Snowflake skips, or prunes, the ones whose value range cannot match. Less data scanned means a faster, cheaper query.
This matters only for large tables. Snowflake guidance points clustering at tables in the multi-terabyte range. A small table already prunes well from its natural load order. Clustering it adds maintenance cost for no real gain.
Which column to cluster on
The exam tests a fixed priority order. Snowflake documentation ranks the choices like this:
- Columns used in selective WHERE filters come first. A date column on a fact table is the classic pick.
- Columns used in JOIN predicates come next, if there is room for another key column.
- Columns used only in GROUP BY or ORDER BY are usually less helpful. Favor the filter and join columns over these.
Candidates I have trained read GROUP BY in a stem and cluster on that column. The documentation places GROUP BY below WHERE and JOIN. When a question lists a WHERE column and a GROUP BY column, the WHERE column wins.
Cardinality matters too. The key needs enough distinct values to prune well. It also needs few enough that rows still group. A status flag with three values prunes almost nothing. A raw unique id scatters too far, so Snowflake suggests an expression to lower its cardinality.
Reading clustering health
Check how well a table is clustered with SYSTEM$CLUSTERING_INFORMATION. Pass it a table name and a column. It returns a JSON report.
SELECT SYSTEM$CLUSTERING_INFORMATION('day10_orders', '(O_ORDERDATE)');The headline field is average_depth. A smaller value means the table is better clustered on that column. The report also lists the micro-partition count, the average overlaps, and notes on whether clustering would help. Read the depth as a score where low is the goal.
Defining a key uses CREATE TABLE ... CLUSTER BY or ALTER TABLE ... CLUSTER BY. Drop it with ALTER TABLE ... DROP CLUSTERING KEY. One note from the docs: a CREATE TABLE AS SELECT does not carry a clustering key over. You add the key after the copy exists.
Automatic Clustering
What runs the maintenance
Once a clustering key exists, DML slowly degrades the order. Inserts, updates, and deletes scatter rows across micro-partitions. Automatic Clustering is the service that reclusters the table to restore order. It runs in the background with no action from you.
The service does not use your virtual warehouse. It runs on serverless compute that Snowflake manages, billed by the second. Snowflake reclusters only when the table will benefit, not on a fixed schedule. Heavy DML means more reclustering and a higher bill.
The edition trap
Here is the distinction the exam leans on. Clustering and Automatic Clustering work on Standard Edition. Materialized views and the Search Optimization Service from Day 38 require Enterprise Edition or higher.
In production projects I have worked on, this trap catches people. They assume Automatic Clustering needs Enterprise, because search optimization does. It does not. A stem that puts an account on Standard Edition still allows clustering. It rules out a materialized view.
| Performance feature | Edition required | Maintenance cost |
|---|---|---|
| Clustering key and Automatic Clustering | Standard Edition or higher | Serverless credits, no warehouse |
| Materialized views | Enterprise Edition or higher | Serverless credits, no warehouse |
| Search Optimization Service (Day 38) | Enterprise Edition or higher | Serverless credits plus storage |
All three bill as serverless add-ons. The edition gate is the part that separates them on the exam.
Materialized Views for Performance
What it stores
A materialized view precomputes the result of its query and stores it like a table. A plain view stores nothing and runs its query on every read. The materialized form trades storage and maintenance for faster reads. It behaves more like a table than like a view.
It suits a heavy aggregation that runs often on data that changes slowly. A daily count by status over a large table is a fit. The result is small. The query repeats. The base data is steady.
It stays current on its own
Snowflake maintains the materialized view for you. A background service updates it after the base table changes. You never refresh it by hand. Data you read through it is always current, whatever DML hit the base table.
That refresh runs on serverless credits, like Automatic Clustering. The trade is real. A base table with constant DML drives constant refresh. The maintenance cost can then outweigh the read savings.
The limits that get tested
A materialized view has restrictions a plain view does not. The exam draws distractors straight from this list:
- It can query only a single table. Joins, including self-joins, are not supported.
- It cannot reference another view, a materialized view, or a user-defined table function.
- The SELECT cannot use a HAVING clause, an ORDER BY, or a LIMIT.
- Only a limited set of aggregate functions is allowed. Some of those cannot be used as window functions.
The single-table rule is the one a question is most likely to target. If a stem needs a join across two tables, a materialized view is wrong before you read the other options. A plain view or a separate table handles the join.
| Need in the stem | Right tool | Why |
|---|---|---|
| Repeated heavy aggregation on slow-changing data | Materialized view | Stores the precomputed result, refreshed for you |
| Large filtered range scan or a join on a key | Clustering key | Prunes micro-partitions on the filtered column (this day) |
| Selective point lookup on a high-cardinality column | Search optimization | Builds a search access path (Day 38) |
| Simplify a query, including a join, with no stored result | Plain view | Stores nothing, supports joins, runs on every read |
Cheat Sheet
| Concept | What to remember | Exam keyword |
|---|---|---|
| Cluster column priority | WHERE filter columns first, then JOIN columns. GROUP BY and ORDER BY are less helpful | “WHERE before JOIN” |
| When to cluster | Large multi-terabyte tables with selective filters. Not small tables | “very large table” |
| Clustering depth | Average overlap on the key. Smaller is better clustered | “lower depth, better” |
| SYSTEM$CLUSTERING_INFORMATION | Reports average depth and clustering health as JSON | “clustering information” |
| Automatic Clustering edition | Standard Edition. Serverless, no warehouse | “Standard” |
| Materialized view edition | Enterprise Edition or higher | “Enterprise” |
| Materialized view refresh | Maintained automatically. Reads are always current | “always current” |
| Materialized view single-table rule | One base table only. No joins, no self-joins | “single table” |
| Materialized view vs plain view | Materialized stores the result. Plain view stores nothing | “precomputed” |
| Cost model | Clustering and materialized view maintenance bill as serverless credits | “serverless” |
Exam Tip
The cluster-column order is a known item. A question hands you a Query Profile and asks which column to cluster on. The answer ranks WHERE filter columns first and JOIN columns second. A GROUP BY or ORDER BY column is a distractor placed to catch the candidate who clusters on whatever appears in the query. Pick the filter column.
The edition split is the second trap. Clustering and Automatic Clustering run on Standard Edition. Materialized views need Enterprise. If a stem sets the scene on Standard Edition, a materialized view option is wrong. A clustering option is still valid.
The single-table rule decides most materialized view questions. A materialized view reads one base table and supports no joins. A stem that needs two tables joined points to a plain view or a precomputed table, never a materialized view.
Hands-On Lab
lab_xs warehouse, and the day10_orders table from Day 10. This lab reads day10_orders for one report only. It never alters it. All clustering and view work happens on a private copy.Read clustering health on the shared table. This function only reports. It changes nothing, so it is safe to run against day10_orders.
USE WAREHOUSE lab_xs;
SELECT SYSTEM$CLUSTERING_INFORMATION('day10_orders', '(O_ORDERDATE)');average_depth in the JSON. On a table loaded in roughly date order, the depth on O_ORDERDATE is already low. That is the point: natural load order can cluster a column well before you define any key.Make a private copy to experiment on. A CREATE TABLE AS SELECT does not carry a clustering key, so the copy starts with none.
CREATE OR REPLACE TABLE clust_demo AS
SELECT * FROM day10_orders;clust_demo is a throwaway you own. Every change from here lands on this copy. The shared day10_orders table is untouched.Define a clustering key on the copy. Cluster on the column you would filter in a WHERE clause.
ALTER TABLE clust_demo CLUSTER BY (O_ORDERDATE);
SELECT SYSTEM$CLUSTERING_INFORMATION('clust_demo', '(O_ORDERDATE)');Build a materialized view on the copy. Precompute a count by status, a small result over a large table.
CREATE MATERIALIZED VIEW mv_demo AS
SELECT O_ORDERSTATUS, COUNT(*) AS order_count
FROM clust_demo
GROUP BY O_ORDERSTATUS;Query the view and confirm it stays current. Read it, change the base, then read it again.
SELECT * FROM mv_demo ORDER BY order_count DESC;
DELETE FROM clust_demo WHERE O_ORDERSTATUS = 'P';
SELECT * FROM mv_demo ORDER BY order_count DESC;List the materialized view and check its source. SHOW reports the base table behind each view.
SHOW MATERIALIZED VIEWS LIKE 'mv_demo';clust_demo as the single base table. The single-table link is visible right there in the catalog.Cleanup. Drop the view and the copy, then suspend the warehouse. The shared table needs no cleanup because nothing touched it.
DROP MATERIALIZED VIEW IF EXISTS mv_demo;
DROP TABLE IF EXISTS clust_demo;
ALTER WAREHOUSE lab_xs SUSPEND;
-- Keep lab_xs itself. It is the shared warehouse reused across days.
-- day10_orders is read-only in this lab and is left as is.Practice Questions
Options:
A. A column that appears most frequently in ORDER BY operations
B. A column that appears most frequently in WHERE operations
C. A column that appears most frequently in GROUP BY operations
D. A column that appears most frequently in AGGREGATE operations
E. A column that appears most frequently in JOIN operations
Why B: Snowflake ranks columns used in selective WHERE filters as the first choice for a clustering key. Filtering on the key is what drives micro-partition pruning.
Why E: JOIN predicate columns are the next priority when there is room for another key column.
Why not A: ORDER BY columns are usually less helpful than filter or join columns, per the docs.
Why not C: GROUP BY sits below WHERE and JOIN in the priority order. It is the classic distractor here.
Why not D: A column used only in an aggregate is not a primary clustering driver.
Options:
A. A materialized view that precomputes a daily aggregation
B. The Search Optimization Service for point lookups
C. A clustering key with Automatic Clustering maintenance
D. None of these, because all three need Enterprise Edition
Why C: Clustering keys and the Automatic Clustering service work on Standard Edition. They run on serverless compute and need no edition upgrade.
Why not A: Materialized views require Enterprise Edition or higher. Standard cannot create one.
Why not B: Search optimization also requires Enterprise Edition. Re-read Day 38 if missed.
Why not D: This answer assumes all three share the Enterprise gate. Clustering is the exception that breaks the pattern.
Options:
A. Clustering is a good fit because the table is queried often
B. Clustering is a poor fit because the table is small and heavily rewritten
C. Clustering will reduce the table’s storage cost
D. Clustering is required before a materialized view can be built on the table
Why B: Clustering targets very large, multi-terabyte tables with selective filters. A small table prunes well on its own. Nightly full rewrites would also force constant reclustering cost.
Why not A: Query frequency alone does not justify clustering. Table size and filter selectivity decide it.
Why not C: Clustering aims at query pruning, not storage reduction. Reclustering can even add a little storage.
Why not D: A materialized view has no clustering prerequisite. The two features are independent.
Options:
A. Materialized views cannot be queried more than once per day
B. A materialized view can reference only a single base table and does not support joins
C. Materialized views do not support COUNT or SUM
D. Materialized views require the base tables to be clustered first
Why B: A materialized view reads exactly one base table. Joins, including self-joins, are not supported. A two-table join rules it out at once.
Why not A: There is no per-day query limit. Frequent reads are the use case materialized views are built for.
Why not C: A limited set of aggregates is supported. Common ones like COUNT work. The join is the blocker, not the aggregate.
Why not D: Materialized views have no clustering prerequisite.
Options:
A. A plain (non-materialized) view over the aggregation
B. A materialized view over the aggregation
C. The Search Optimization Service on the table
D. A larger warehouse for the dashboard
Why B: A repeated aggregation on a single, slow-changing table is the materialized view use case. The result is precomputed and stays current through automatic serverless refresh.
Why not A: A plain view stores nothing and reruns the full aggregation on every read. It saves no compute.
Why not C: Search optimization targets selective point lookups, not a full aggregation. Re-read Day 38 if missed.
Why not D: A bigger warehouse runs the same work faster but still reruns it every time. It precomputes nothing.
Snowflake Documentation
Official docs for today’s topics. The exam pulls directly from these.
Today you learned: two ways to change the data so a query does less work. A clustering key co-locates rows so a filtered query prunes micro-partitions. The priority order is WHERE filter columns first, then JOIN columns. GROUP BY and ORDER BY columns are less helpful. Clustering depth scores the result, where a smaller average depth is better. Automatic Clustering maintains the order on serverless compute and runs on Standard Edition. A materialized view precomputes a result over a single table, stays current through automatic refresh, and needs Enterprise Edition.
Key takeaway: match the tool to the stem. A large filtered scan or a join points to a clustering key. A repeated aggregation on slow-changing single-table data points to a materialized view. Read the edition in the stem: clustering works on Standard, materialized views need Enterprise. A join in the stem rules a materialized view out, because it reads only one base table.
Tomorrow (Day 40): Caching: Result, Metadata, and Warehouse. You meet the three cache layers and when each one answers a query for free. Carry in two rules. The result cache lasts 24 hours from last use. The warehouse cache is cleared when the warehouse suspends.
