D3: Data Loading Week 5

DAY 33

Snowflake Streams & Tasks – Change Data Capture & Scheduling

Day 32 loaded files as they arrived. Today tracks what changes after the data lands. A stream records the inserts, updates, and deletes made to a table, without copying the data. A task runs one SQL statement on a schedule, or only when a stream has new rows. Put together, they automate change data capture in pure SQL. Two facts carry most of today’s questions: a stream stores an offset rather than data. Only DML that consumes a stream moves that offset.

🗣️ Plain-English First

Five words anchor today’s topic. Each one hides a fact the stems assume you know. Read this table before the concepts. The offset row is the one the exam returns to most.

Word you know	What it usually means	What it means in Snowflake
Stream	Water flowing in a channel	An object that tracks the row changes made to a source table since you last read it. It stores no table data.
Offset	An amount that cancels another	A bookmark on the source table’s version timeline. The stream returns changes that happened after this point.
Consume	To use something up	To read a stream inside a DML statement. Consuming a stream moves its offset forward and clears what it returned.
Task	A job to be done	An object that runs one SQL statement on a schedule, after another task, or when a stream has change data.
Root task	(not common usage)	The single starting task in a task graph. It holds the schedule. Every other task in the graph runs after a predecessor.

Why the offset row matters today: a stream does not hold changed rows in storage. It holds a pointer. A stem that says “the stream stores the changed data” is false. Several questions rest on that one distinction.

📘

Today’s Concept

Micro-Concept 1: A Stream Tracks Change, It Does Not Store It

A stream is a Snowflake object built on a source table. It records the DML changes to that table: inserts, updates, and deletes. You query it like a table to see what changed.

The point candidates miss is what a stream holds in storage. The answer is almost nothing. A stream stores an offset, a pointer to a version of the source table. When you query the stream, Snowflake reads the source table’s own version history. It returns the rows that changed after the offset.

This design has a direct consequence. A stream consumes no meaningful storage of its own. It also means the source table must keep enough version history for the stream to read. That history comes from Time Travel retention, the same retention concept from Day 11.

A stream returns the source columns plus three metadata columns. They describe each change. The next section reads them in the lab.

Micro-Concept 2: The Three Stream Types and the Metadata Columns

Streams come in three types. The exam tests which one tracks what. It also tests which objects each supports.

A standard stream tracks all DML: inserts, updates, and deletes, including a table truncate. This is the default and the full change data capture option. It works on standard tables, directory tables, and views.

An append-only stream tracks inserts only. Updates and deletes are not recorded. You set APPEND_ONLY = TRUE at creation. It runs faster for insert-only pipelines, because deletes add no overhead. It also supports standard tables, directory tables, and views.

An insert-only stream is the narrow one. It is supported for streams on external tables and externally managed Iceberg tables only. It records new rows from new files. It does not record a file removal as a delete. A stem that pairs insert-only with a standard table is wrong on the object type.

Stream type	What it records	Supported on
Standard (default)	Inserts, updates, deletes, truncates	Tables, directory tables, views
Append-only	Inserts only	Tables, directory tables, views
Insert-only	Inserts only, from new files	External and externally managed Iceberg tables

Every stream adds three metadata columns to the source columns. METADATA$ACTION holds INSERT or DELETE. METADATA$ISUPDATE is TRUE when the row is part of an update. METADATA$ROW_ID is a stable id for the row across changes.

One detail trips people in the lab and on the exam. Snowflake records an update as a pair of rows: a DELETE of the old values and an INSERT of the new ones. Both carry METADATA$ISUPDATE = TRUE. A plain insert shows METADATA$ACTION = INSERT with METADATA$ISUPDATE = FALSE.

SQL

-- Standard stream: full CDC on a table
CREATE STREAM cdc_stream ON TABLE cdc_source;

-- Append-only stream: inserts only
CREATE STREAM ins_stream ON TABLE cdc_source APPEND_ONLY = TRUE;

Micro-Concept 3: The Offset Only Moves When DML Consumes the Stream

This is the single most tested stream fact. Reading a stream with a plain SELECT does not move its offset. You can query it as many times as you like and see the same changes each time.

The offset advances only when a DML statement consumes the stream. That means an INSERT, MERGE, UPDATE, DELETE, or CREATE TABLE AS that reads from the stream. After that statement commits, the offset jumps to the current version. The stream then looks empty until the next change arrives.

A stem that says “querying a stream advances its offset” is false. Hold the rule this way: SELECT looks, DML consumes. Only consuming moves the bookmark.

A stream can also go stale. This happens if its offset falls outside the source table’s data retention window. To prevent that, Snowflake extends the retention to cover an unconsumed stream’s offset, up to 14 days by default. A stale stream returns no data and must be recreated. The fix is to consume a stream regularly, well inside its retention window.

Micro-Concept 4: Tasks Run SQL on a Schedule or on a Stream Gate

A task runs one SQL statement or one procedure call. It runs on a schedule you define, or after another task, or only when a stream holds change data. A task is the engine that drains a stream on a cadence.

Two facts about task compute appear often. If you set the WAREHOUSE parameter, the task is user-managed and runs on that virtual warehouse. If you omit WAREHOUSE, the task is serverless and runs on Snowflake-managed compute. That serverless option is the same idea as the serverless loading from Day 32.

A new task starts suspended. You enable it with ALTER TASK ... RESUME. To run it once by hand, use EXECUTE TASK. The minimum schedule interval is one minute.

The stream gate is the WHEN clause with SYSTEM$STREAM_HAS_DATA. Snowflake validates this condition in the Cloud Services layer, before any warehouse starts. If the stream has no change data, the run is skipped. A skipped run avoids the warehouse cost entirely. This pairing is the standard change data capture loop. A task wakes on schedule and checks the gate. It does work only when the stream has changes.

SQL

-- Run only when the stream has change data
CREATE TASK cdc_task
  WAREHOUSE = lab_xs
  SCHEDULE = '1 MINUTE'
  WHEN SYSTEM$STREAM_HAS_DATA('cdc_stream')
  AS
  INSERT INTO cdc_target
    SELECT id, name, status
    FROM cdc_stream
    WHERE METADATA$ACTION = 'INSERT';

-- Serverless variant: omit WAREHOUSE
-- CREATE TASK cdc_task
--   USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE = 'XSMALL'
--   SCHEDULE = '1 MINUTE' ...

Micro-Concept 5: Task Graphs Chain Many Tasks

One task runs one statement. Real pipelines need several steps in order. A task graph, also called a DAG, chains tasks into a dependency tree.

A graph has exactly one root task. The root task holds the schedule. Every child task uses the AFTER parameter to name one or more predecessors. A child has no schedule of its own. It runs after its predecessors finish successfully. The graph flows one direction only, with no loops.

Two limits and one rule show up in questions. A graph allows up to 1,000 tasks. Each task allows up to 100 predecessors and 100 child tasks. All tasks in a graph must share one owner role. To change any task in a running graph, suspend the root task first, then edit, then resume.

Resuming the whole graph by hand is tedious. SYSTEM$TASK_DEPENDENTS_ENABLE resumes the root and all of its children in one call. In training sessions I have run, the most common task-graph mistake is editing a child while the root is still resumed. Snowflake blocks the edit until the root is suspended.

⚡

Cheat Sheet

Concept	What to remember	Exam keyword
Stream storage	Stores an offset, not the changed data. Reads the source version history	“stream = offset, not data”
Offset advance	`SELECT` does not move it. Only DML consuming the stream does	“DML consumes, SELECT looks”
Standard stream	Inserts, updates, deletes, truncates. The default, full CDC	“standard = all DML”
Append-only	Inserts only. `APPEND_ONLY = TRUE`. Tables, directory tables, views	“append-only = inserts”
Insert-only	Inserts only, for external and externally managed Iceberg tables	“insert-only = external tables”
Update in a stream	A DELETE plus an INSERT, both with `METADATA$ISUPDATE = TRUE`	“update = delete + insert”
Stale stream	Offset outside retention. Snowflake extends up to 14 days by default	“stale = past retention, 14-day extend”
Task compute	Set `WAREHOUSE` = user-managed. Omit it = serverless	“omit warehouse = serverless”
Task state	Created suspended. `ALTER TASK ... RESUME` to enable. `EXECUTE TASK` runs once	“starts suspended”
Stream gate	`WHEN SYSTEM$STREAM_HAS_DATA(...)`. FALSE skips the run, no warehouse cost	“WHEN false = skip”
Task graph (DAG)	One scheduled root, children via `AFTER`. Max 1,000 tasks, one owner role	“one root, AFTER for children”

🎯

Exam Tip

🎯 Exam Tip

Domain 3 stream and task questions cluster around four claims. Each is true or false on one fact.

First: “The stream stores the changed data.” False. A stream stores an offset and reads the source table’s version history. No row data lives in the stream.

Second: “Querying a stream advances its offset.” False. A plain SELECT only looks. The offset moves only when DML consumes the stream, such as an INSERT or MERGE that reads from it.

Third: “A serverless task needs a warehouse you size.” False. Omit the WAREHOUSE parameter and the task is serverless. Set the parameter and it is user-managed. Same fact as Snowpipe from Day 32: serverless means no warehouse to size.

Fourth: “A child task carries its own schedule.” False. In a task graph, only the root task has a schedule. Children run after a predecessor through the AFTER parameter. If a stem gives a child its own schedule, it is wrong.

🛠️

Hands-On Lab

Type: LAB (guided) | Time: ~20 minutes | Credits: <0.5 | Prerequisite: lab_xs warehouse (Day 1). This lab builds its own small tables. It does not touch day10_orders. We use tiny data so the stream contents are readable row by row. You will watch an offset move only when DML consumes the stream.

Create the source table and seed two rows. This is the table the stream will watch.

SQL

USE WAREHOUSE lab_xs;

CREATE OR REPLACE TABLE cdc_source (
  id     INT,
  name   STRING,
  status STRING
);

INSERT INTO cdc_source VALUES
  (1, 'Alice', 'active'),
  (2, 'Bob',   'active');

SELECT * FROM cdc_source ORDER BY id;

👀 Observe: two rows, Alice and Bob, both active. The stream does not exist yet, so these two rows are the starting state.

Create a standard stream and read it once. The offset is set at creation, so the stream starts empty.

SQL

CREATE OR REPLACE STREAM cdc_stream ON TABLE cdc_source;

SELECT * FROM cdc_stream;

👀 Observe: the stream returns zero rows. It captured the offset at creation. The two existing rows are part of the baseline, not a change after the offset.

Make a change: one insert and one update. Now the source has DML after the offset.

SQL

INSERT INTO cdc_source VALUES (3, 'Charlie', 'active');

UPDATE cdc_source SET status = 'inactive' WHERE id = 1;

SELECT *, METADATA$ACTION, METADATA$ISUPDATE
  FROM cdc_stream
  ORDER BY id;

👀 Observe: Charlie shows one row with METADATA$ACTION = INSERT and METADATA$ISUPDATE = FALSE. The update on Alice shows as two rows: a DELETE of the old value and an INSERT of the new value, both with METADATA$ISUPDATE = TRUE. This SELECT read the stream but did not move its offset.

Prove that a query does not advance the offset. Run the same read again.

SQL

SELECT *, METADATA$ACTION, METADATA$ISUPDATE
  FROM cdc_stream
  ORDER BY id;

SELECT SYSTEM$STREAM_HAS_DATA('cdc_stream') AS has_changes;

👀 Observe: the same change rows appear again. The offset has not moved. SYSTEM$STREAM_HAS_DATA returns TRUE. That is the gate a task will check.

Create the target table and the task. The task copies new inserts into the target when the stream has data.

SQL

CREATE OR REPLACE TABLE cdc_target AS
  SELECT * FROM cdc_source WHERE 1 = 0;

CREATE OR REPLACE TASK cdc_task
  WAREHOUSE = lab_xs
  SCHEDULE = '1 MINUTE'
  WHEN SYSTEM$STREAM_HAS_DATA('cdc_stream')
  AS
  INSERT INTO cdc_target
    SELECT id, name, status
    FROM cdc_stream
    WHERE METADATA$ACTION = 'INSERT';

ALTER TASK cdc_task RESUME;

👀 Observe: cdc_target is created empty, with the same columns as the source. The task is created suspended, then resumed. It will fire within a minute because the gate is TRUE.

Run the task now and check the result. EXECUTE TASK triggers one run instead of waiting for the schedule. The task consumes the stream. That advances the offset.

SQL

EXECUTE TASK cdc_task;

-- Give the run a few seconds, then check
SELECT * FROM cdc_target ORDER BY id;

SELECT SYSTEM$STREAM_HAS_DATA('cdc_stream') AS has_changes;

👀 Observe: cdc_target now holds the rows with METADATA$ACTION = INSERT: Charlie and Alice’s updated row, since an update writes a new INSERT row. After the task consumes the stream, SYSTEM$STREAM_HAS_DATA returns FALSE. The offset has moved, so the next scheduled run skips. If the target is still empty, wait a few seconds and re-run the SELECT; the task is asynchronous.

Clean up. Suspend the task before dropping anything.

SQL

ALTER TASK cdc_task SUSPEND;

DROP TASK   IF EXISTS cdc_task;
DROP STREAM IF EXISTS cdc_stream;
DROP TABLE  IF EXISTS cdc_source;
DROP TABLE  IF EXISTS cdc_target;

👀 Observe: every object built in this lab is gone. Nothing here read or changed day10_orders, so the persistent Day 10 table is untouched for later days.

📚

Snowflake Docs

Authoritative references for every fact in today’s post. The streams introduction and the tasks introduction are the two worth re-reading the day before the exam.

❄️

Introduction to Streams: Offsets, Types, and Metadata Columns

docs.snowflake.com/en/user-guide/streams-intro

→

❄️

Introduction to Tasks: Schedules, Serverless, and the WHEN Gate

docs.snowflake.com/en/user-guide/tasks-intro

→

❄️

SYSTEM$STREAM_HAS_DATA: The Task WHEN Condition

docs.snowflake.com/en/sql-reference/functions/system_stream_has_data

→

🔗

External References

The CREATE STREAM reference lists the three stream types and the metadata columns. The task graph guide covers the root task, the AFTER parameter, and the graph limits.

Reference Snowflake Documentation

CREATE STREAM: Standard, Append-only, and Insert-only Streams

Guide Snowflake Documentation

Task Graphs: Root Tasks, AFTER Dependencies, and Limits

❓

Practice Questions

Q1: A stream on a table has change data. Which action advances the stream’s offset? ▸

Options:

A. Running SELECT * FROM the_stream
B. An INSERT INTO target SELECT ... FROM the_stream that commits
C. Calling SYSTEM$STREAM_HAS_DATA on the stream
D. Running DESCRIBE STREAM the_stream

✅ Answer: B

Why B: The offset advances only when DML consumes the stream. An INSERT that reads from the stream and commits moves the offset to the current version. The stream then looks empty.

Why not A: A plain SELECT only reads the change data. It never moves the offset, no matter how many times you run it.

Why not C: SYSTEM$STREAM_HAS_DATA reports whether changes exist. It does not consume them, so the offset stays put.

Why not D: DESCRIBE STREAM returns metadata about the stream object. It reads nothing from the change set.

Q2: Which statements correctly describe Snowflake stream types? (Select TWO) ▸

Options:

A. A standard stream tracks inserts, updates, and deletes.
B. An append-only stream records inserts, updates, and deletes.
C. An append-only stream records inserts only.
D. An insert-only stream is supported on any standard table.
E. A standard stream cannot track a table truncate.

✅ Answer: A and C

Why A and C: A standard stream is full change data capture: inserts, updates, deletes, including truncates. An append-only stream records inserts only, which makes it lighter for insert-driven pipelines.

Why not B: That describes a standard stream. Append-only ignores updates and deletes by design.

Why not D: Insert-only streams are supported on external and externally managed Iceberg tables, not standard tables. The object type is the trap.

Why not E: A standard stream does track truncates. It records them as deletes for the affected rows.

Q3: A task uses WHEN SYSTEM$STREAM_HAS_DATA(‘s’) and is scheduled every minute. The stream ‘s’ has no change data at the scheduled time. What happens? ▸

Options:

A. The task runs its SQL and inserts nothing.
B. The task run is skipped, and no warehouse is started.
C. The task fails with an error.
D. The task becomes suspended automatically.

✅ Answer: B

Why B: Snowflake validates the WHEN condition in the Cloud Services layer first. If SYSTEM$STREAM_HAS_DATA is FALSE, the run is skipped before any warehouse starts. This avoids paying for an empty run.

Why not A: The SQL body never runs when the gate is false. Snowflake does not start the statement at all.

Why not C: A skipped run is normal behavior, not an error. The task stays scheduled for its next check.

Why not D: A skipped run does not suspend the task. Auto-suspend applies to consecutive failures, which a skip is not.

Q4: Which statements about a task graph (DAG) are correct? (Select TWO) ▸

Options:

A. A task graph can have several root tasks running on different schedules.
B. Only the root task has a schedule; child tasks run after a predecessor via AFTER.
C. To modify a task in a running graph, suspend the root task first.
D. Each task in a graph may have a different owner role.
E. A child task must define its own SCHEDULE.

✅ Answer: B and C

Why B and C: A graph has one scheduled root task. Children link with AFTER and have no schedule of their own. To change any task in the graph, you suspend the root first, then edit, then resume.

Why not A: A task graph has exactly one root task, not several. One schedule drives the whole graph.

Why not D: All tasks in a graph must share a single owner role with the OWNERSHIP privilege on each.

Why not E: A child task cannot have its own schedule. The AFTER dependency replaces the schedule for children.

Q5: A candidate states: “A stream stores the changed rows in its own storage.” Is this correct, and why? ▸

Options:

A. Correct. A stream copies each changed row into a hidden change table.
B. Incorrect. A stream stores an offset and reads the source table’s version history.
C. Correct. A stream stores changed rows for 14 days, then deletes them.
D. Incorrect. A stream stores nothing and cannot return change data.

✅ Answer: B

Why B: A stream stores only an offset, a pointer to a source version. It returns change data by reading the source table’s own version history. It holds no copy of the rows.

Why not A: There is no hidden copy of changed rows inside the stream. The change data comes from the source versioning.

Why not C: The 14-day figure is the default retention extension that prevents a stream going stale. It is not a store of copied rows.

Why not D: The stream does hold an offset, and it does return change data on query. Saying it stores nothing and returns nothing is wrong on both counts.

📝 Recap

Today you learned: A stream tracks the DML changes on a source table. It stores an offset, not the data, and reads the source’s version history to return changes. Standard streams track inserts, updates, and deletes. Append-only streams track inserts on tables, directory tables, and views. Insert-only streams serve external and externally managed Iceberg tables. A stream adds METADATA$ACTION, METADATA$ISUPDATE, and METADATA$ROW_ID. It records an update as a delete plus an insert. The offset advances only when DML consumes the stream, never on a plain query. A task runs one statement on a schedule, after another task, or when a stream gate passes. Omit the warehouse for a serverless task. A task graph has one scheduled root task, with children linked by AFTER.

Key takeaway: Two facts carry most of the Day 33 questions. A stream stores an offset, not the changed rows. The offset moves only when DML consumes the stream. Add the serverless rule (omit the warehouse) and the one-root-task rule for graphs. The rest of the syntax follows.

Tomorrow (Day 34): Drivers, connectors, and integrations. We move from automating change inside Snowflake to connecting it to the outside. The session covers client drivers and connectors, plus storage, API, and Git integrations that let Snowflake reach external systems safely.

← Day 32 All 50 Days Day 34 →

About the Author

Abhay Krishnan

Senior Data & AI Consultant

Connect on LinkedIn

With over five years of data engineering experience at EY and Infosys, Abhay Krishnan specializes in building scalable data pipelines and cloud warehousing solutions. He is a certified SnowPro Core professional, alongside credentials in AWS and Azure. Abhay created this 50-day track to solve a problem he faced firsthand: the lack of a structured, free resource for Snowflake certification prep. Follow him on LinkedIn for more data engineering insights.

Snowflake Streams & Tasks – Change Data Capture & Scheduling

Today’s Concept

Micro-Concept 1: A Stream Tracks Change, It Does Not Store It

Micro-Concept 2: The Three Stream Types and the Metadata Columns

Micro-Concept 3: The Offset Only Moves When DML Consumes the Stream

Micro-Concept 4: Tasks Run SQL on a Schedule or on a Stream Gate

Micro-Concept 5: Task Graphs Chain Many Tasks

Cheat Sheet

Exam Tip

Hands-On Lab

Snowflake Docs

External References

Practice Questions

Topics