D3: Data Loading Week 5

DAY 31

Snowflake Data Unloading & Parquet Type Mapping

Day 30 read staged files from a stage into tables. Today’s command sends those same rows the other way. The command keeps its name: COPY INTO <location>. The output defaults and the Parquet type rules are where the exam questions sit. One Parquet fact below appears as its own single-select question.

🗣️ Plain-English First

Unloading reuses one command and adds five new options. Each option has a default the exam assumes when the stem stays silent. Read this table before the concepts. The SINGLE and HEADER rows are tested most often.

Word you know	What it usually means	What it means in Snowflake
Unload	Take cargo off a truck	Write table rows out to files in a stage. The reverse of loading. The command is `COPY INTO <location>`, with a stage path on the left.
SINGLE	One of something	A copy option for file count. `FALSE` is the default and writes many files in parallel. `TRUE` writes exactly one file.
MAX_FILE_SIZE	The biggest a file can be	An upper target in bytes for each unloaded file. Default is 16 MB. Snowflake aims for it but does not promise it.
HEADER	A heading at the top of a page	A copy option that writes the column names as the first row of each output file. Off by default.
PARTITION BY	Sort items into bins	A copy option that splits rows into separate files using an expression. It builds a folder layout inside the stage.

Why the SINGLE row matters today: the default is multiple files, not one. Candidates who expect a single output file pick the wrong answer on the most basic unload question. Get the default right and the rest of the options follow from it.

📘

Today’s Concept

Micro-Concept 1: COPY INTO <location>, the Unload Direction

The same COPY INTO command runs in two directions. Day 30 used the load form, stage to table. Today uses the unload form, table to stage. The stage path moves to the left of the FROM clause. That single change flips the direction.

The source can be a table. It can also be a SELECT query. A query source accepts the full Snowflake SQL syntax, including JOIN. This is the reverse of the load side. A load transformation could not join, but an unload query can.

SQL

-- Unload a whole table to a stage path
COPY INTO @my_stage/unload/
  FROM sales
  FILE_FORMAT = (TYPE = CSV);

-- Unload the result of a query, joins allowed
COPY INTO @my_stage/unload/
  FROM (
    SELECT s.order_id, c.region, s.amount
    FROM sales s
    JOIN customers c ON s.cust_id = c.cust_id
  )
  FILE_FORMAT = (TYPE = CSV);

Three destinations are valid: an internal stage, an external stage, and a direct external location. With no options set, the output is CSV. The files are GZIP-compressed and UTF-8 encoded. Those three defaults sit behind many “what does the file look like” stems.

Micro-Concept 2: One File or Many, SINGLE and MAX_FILE_SIZE

The first decision an unload makes is file count. The SINGLE option controls it. The default is SINGLE = FALSE. That setting writes multiple files. Each parallel thread writes its own file. The names follow a pattern like data_0_0_0.csv.gz.

The file count scales with the warehouse. One server writes up to eight files at once. A larger warehouse has more servers, so it produces more, smaller files. This is the opposite of intuition: scaling up makes files smaller, not bigger.

Set SINGLE = TRUE to force one output file. It runs without the parallel split, so it is slower on large tables. With SINGLE = TRUE and no filename in the path, the file lands with no extension. Supply a filename and extension in the path to fix that.

Option	Default	What it controls
`SINGLE`	`FALSE`	One file (`TRUE`) versus many parallel files (`FALSE`).
`MAX_FILE_SIZE`	16 MB	Upper target in bytes per file. Raise it for fewer, larger files.

MAX_FILE_SIZE sets the upper target for each file in a multi-file unload. The default is 16 MB (16,777,216 bytes). The maximum supported value is 5 GB for an external stage. The target is not a promise. A file can finish smaller when memory runs short. A single set of rows can push a file slightly past the limit.

Micro-Concept 3: HEADER and OVERWRITE

Two options decide what the output looks like and what happens to old files. Both default to off. Both catch candidates who assume the friendlier behaviour.

HEADER writes the column names as the first row. The default is FALSE, so unloaded files carry no header by default. Set HEADER = TRUE when a downstream tool expects a name row. On a recent migration, an unload left HEADER at its default. The loader on the far side expected a header line and rejected every file.

OVERWRITE decides what happens when a file of the same name already sits in the path. The default is FALSE. A second unload to a path with matching filenames raises an error instead of replacing them. Set OVERWRITE = TRUE to replace the old files. This is the unload-side equivalent of a name collision, not the 64-day load metadata from Day 30.

SQL

-- One file, with a header row, replacing any old file
COPY INTO @my_stage/unload/orders.csv
  FROM sales
  FILE_FORMAT = (TYPE = CSV)
  SINGLE = TRUE
  HEADER = TRUE
  OVERWRITE = TRUE;

Micro-Concept 4: Parquet Type Mapping, the Fact With Its Own Question

Unloading to Parquet changes column types in a fixed, documented way. The exam tests one mapping directly. A fixed-point column unloads to a Parquet DECIMAL column. The precision and scale are preserved. A NUMBER(12,2) total stays exact in the file.

A floating-point column unloads to a Parquet DOUBLE column. That is the pair the exam plays on. Fixed-point goes to DECIMAL. Floating-point goes to DOUBLE. The trap swaps them and claims fixed-point becomes DOUBLE. That claim is false.

A few other mappings are worth a glance. VARIANT, GEOGRAPHY, and GEOMETRY unload as JSON-encoded strings. TIMESTAMP_NTZ(9) unloads as milliseconds, not nanoseconds. Some types must be cast before they unload. You can also cast a number explicitly to choose its Parquet type.

SQL

-- Fixed-point NUMBER(12,2) lands as Parquet DECIMAL, precision kept
COPY INTO @my_stage/unload/parquet/
  FROM (SELECT order_id, amount FROM sales)
  FILE_FORMAT = (TYPE = PARQUET);

-- Cast explicitly to pick the Parquet integer width
COPY INTO @my_stage/unload/parquet/
  FROM (SELECT CAST(order_id AS INT) FROM sales)
  FILE_FORMAT = (TYPE = PARQUET);

Micro-Concept 5: Partitioned Unload With PARTITION BY

The PARTITION BY option splits rows into separate files using an expression. Each distinct value of the expression gets its own path. The result is a folder layout in the stage. This is how Snowflake feeds a data lake that reads by directory.

The filenames carry the partition values and a data_ prefix. Snowflake also adds a UUID to each filename. The reason is INCLUDE_QUERY_ID. It defaults to TRUE the moment PARTITION BY is set. The UUID is the query ID of the unload statement.

Two cautions appear in the documentation. An ORDER BY paired with PARTITION BY does not guarantee row order in the files. Snowflake also advises partitioning on dates or timestamps rather than on sensitive string or numeric values. The partition values become visible in the file paths. In production projects I have worked on, PARTITION BY on a date column is the standard way to lay out lake data by day.

SQL

-- Split rows into one folder per order status
COPY INTO @my_stage/unload/parts/
  FROM (SELECT order_id, order_status, amount FROM sales)
  PARTITION BY ('status=' || order_status)
  FILE_FORMAT = (TYPE = CSV COMPRESSION = GZIP)
  HEADER = TRUE;
-- Produces paths like .../parts/status=O/data_..._.csv.gz

One account-level detail rounds this out. Two parameters can block unloading entirely. PREVENT_UNLOAD_TO_INLINE_URL stops ad hoc unloads to a cloud URL written into the statement. PREVENT_UNLOAD_TO_INTERNAL_STAGES stops unloads to any internal stage. Both are governance controls an admin sets, not copy options.

⚡

Cheat Sheet

Concept	What to remember	Exam keyword
Unload command	`COPY INTO <location>`. Table or query out to a stage	“unload = reverse COPY”
Query source	A `SELECT` source allows `JOIN`, unlike a load transform	“unload SELECT can JOIN”
Default output	CSV, GZIP-compressed, UTF-8, multiple files	“CSV GZIP by default”
SINGLE	Default `FALSE` = many parallel files. `TRUE` = one file	“SINGLE default FALSE”
File count vs warehouse	Bigger warehouse makes more, smaller files. 8 files per server	“scale up = smaller files”
MAX_FILE_SIZE	Default 16 MB. Max 5 GB external. Upper target, not a promise	“16 MB default”
HEADER	Default `FALSE`. Set `TRUE` for a column-name row	“HEADER default FALSE”
OVERWRITE	Default `FALSE`. Same-named files raise an error	“OVERWRITE = TRUE to replace”
Parquet fixed-point	NUMBER / DECIMAL unloads to Parquet `DECIMAL`, precision kept	“fixed-point = DECIMAL”
Parquet floating-point	FLOAT / DOUBLE unloads to Parquet `DOUBLE`	“float = DOUBLE”
PARTITION BY	Splits rows into files by an expression. Builds a folder layout	“PARTITION BY”
Partition filenames	Carry partition values plus a UUID (`INCLUDE_QUERY_ID` default TRUE)	“UUID in partition files”

🎯

Exam Tip

🎯 Exam Tip

Domain 3 unload questions cluster around three patterns. Each maps to one fact you can state in a sentence.

First: “What is the default?” The three that get tested are SINGLE = FALSE (multiple files), HEADER = FALSE (no header row), and MAX_FILE_SIZE = 16 MB. A stem that says “by default” wants one of these three. The common miss is assuming a single output file.

Second: “Parquet type mapping.” Fixed-point columns unload to DECIMAL. Floating-point columns unload to DOUBLE. The trap statement reverses the pair. Read the column type in the stem, then match it to the right Parquet type.

Third: “Which option builds a directory layout?” The stem describes a folder-per-value structure for a data lake. The answer is PARTITION BY. The distractors are SINGLE and MAX_FILE_SIZE. Both change file count and size but never build folders.

🛠️

Hands-On Lab

Type: LAB (guided) | Time: ~20 minutes | Credits: <0.5 | Prerequisite: lab_xs warehouse (Day 1) and day10_orders table (Day 10). Both should still be present. This lab only reads day10_orders; it is never modified.

Set up a fresh stage for the unloaded files. The persistent day10_orders table is the read-only source for every step below.

SQL

USE WAREHOUSE lab_xs;

CREATE OR REPLACE STAGE day31_stage
  ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE');

-- Confirm the source table is present and untouched
SELECT COUNT(*) AS source_rows FROM day10_orders;

👀 Observe: source_rows returns roughly 1.5 million. The SNOWFLAKE_SSE setting on the stage is the Day 29 habit carried forward. Nothing in this lab writes to day10_orders; every step unloads from it.

Multi-file unload with the default SINGLE = FALSE. A small MAX_FILE_SIZE forces several output files.

SQL

COPY INTO @day31_stage/unload/multi/
  FROM day10_orders
  FILE_FORMAT = (TYPE = CSV COMPRESSION = GZIP)
  HEADER = TRUE
  MAX_FILE_SIZE = 5000000
  OVERWRITE = TRUE;

LIST @day31_stage/unload/multi/;

👀 Observe: LIST returns several files named like data_0_0_0.csv.gz. The 5 MB cap split the table across files. HEADER = TRUE wrote a column-name row into each one. Parallel threads produced the multiple files.

Single-file unload with SINGLE = TRUE. A named path gives the file a clean extension.

SQL

COPY INTO @day31_stage/unload/single/orders_sample.csv
  FROM (SELECT o_orderkey, o_orderstatus, o_totalprice
        FROM day10_orders LIMIT 1000)
  FILE_FORMAT = (TYPE = CSV)
  SINGLE = TRUE
  HEADER = TRUE
  OVERWRITE = TRUE;

LIST @day31_stage/unload/single/;

👀 Observe: LIST shows exactly one file under single/. SINGLE = TRUE skipped the parallel split. The filename in the path gave it the .csv extension. Without that name, the file would land with no extension.

Parquet unload to see the type mapping. The o_totalprice column is a fixed-point NUMBER(12,2).

SQL

COPY INTO @day31_stage/unload/parquet/
  FROM (SELECT o_orderkey, o_orderstatus, o_totalprice
        FROM day10_orders LIMIT 10000)
  FILE_FORMAT = (TYPE = PARQUET);

LIST @day31_stage/unload/parquet/;

👀 Observe: LIST returns one or more .parquet files. Inside them, o_totalprice is a Parquet DECIMAL column. The precision and scale of NUMBER(12,2) are preserved. A floating-point column here would have become DOUBLE instead.

Partitioned unload into a folder layout. Each order status gets its own directory.

SQL

COPY INTO @day31_stage/unload/parts/
  FROM (SELECT o_orderkey, o_orderstatus, o_totalprice
        FROM day10_orders)
  PARTITION BY ('status=' || o_orderstatus)
  FILE_FORMAT = (TYPE = CSV COMPRESSION = GZIP)
  HEADER = TRUE
  MAX_FILE_SIZE = 5000000;

LIST @day31_stage/unload/parts/;

👀 Observe: LIST shows subfolders such as status=O/, status=F/, and status=P/. The three order-status values became three partition paths. Each filename carries a UUID, because INCLUDE_QUERY_ID defaulted to TRUE once PARTITION BY was set.

Clean up the unloaded files and keep the source table.

SQL

-- Remove every unloaded file under this stage
REMOVE @day31_stage/unload/;

-- Drop the stage created for this lab
DROP STAGE IF EXISTS day31_stage;

-- Confirm the persistent table is untouched (Days 32 and 35 read it)
SELECT COUNT(*) AS row_count FROM day10_orders;

👀 Observe: REMOVE clears all unloaded files in one statement. After the DROP STAGE, the lab leaves nothing behind. The final SELECT COUNT(*) on day10_orders returns roughly 1.5 million rows, confirming the persistent Day 10 table is intact for tomorrow.

📚

Snowflake Docs

Authoritative references for every fact in today’s post. The COPY INTO <location> page and the considerations page are the two worth re-reading the day before the exam.

❄️

COPY INTO <location> Reference

docs.snowflake.com/en/sql-reference/sql/copy-into-location

→

❄️

Overview of Data Unloading

docs.snowflake.com/en/user-guide/data-unload-overview

→

❄️

Data Unloading Considerations: Parquet Type Mapping and File Sizing

docs.snowflake.com/en/user-guide/data-unload-considerations

→

🔗

External References

Two Snowflake guides that walk through the unload steps and the download path.

Guide Snowflake Documentation

Unload Into a Snowflake Stage: COPY INTO, LIST, and GET

Guide Snowflake Documentation

Unload Data From Snowflake: Stages, Formats, and Cloud Targets

❓

Practice Questions

Q1: When table data is unloaded to Parquet files, fixed-point number columns are unloaded as columns with what data type? ▸

Options:

A. DECIMAL
B. DOUBLE
C. INTEGER
D. VARCHAR

✅ Answer: A

Why A: Fixed-point types (NUMBER and DECIMAL with defined precision and scale) unload to Parquet DECIMAL columns. The precision and scale carry over unchanged, so the values stay exact.

Why not B: DOUBLE is where floating-point columns land on a Parquet unload. Swapping it with fixed-point is the intended trap.

Why not C: INTEGER would drop the fractional part. It is never the mapping for a fixed-point column.

Why not D: VARCHAR would discard the numeric type entirely. Parquet keeps the number, not a string.

Q2: By default, how does COPY INTO <location> write its output when unloading a table? ▸

Options:

A. As a single file named after the source table.
B. As multiple files written by parallel threads.
C. As one file per micro-partition in the table.
D. As a single compressed archive of all rows.

✅ Answer: B

Why B: SINGLE defaults to FALSE. The unload splits across parallel threads and writes multiple files. The count scales with the warehouse size.

Why not A: SINGLE = TRUE produces one file, but it is not the default. The default is many files.

Why not C: Output file count tracks parallel threads and MAX_FILE_SIZE, not the source micro-partition count.

Why not D: Files are GZIP-compressed one by one, not bundled into a single archive.

Q3: A CSV unload finished, but the output files have no column-name row. Which option was left at its default? ▸

Options:

A. SINGLE
B. HEADER
C. OVERWRITE
D. COMPRESSION

✅ Answer: B

Why B: HEADER defaults to FALSE. Snowflake writes no column-name row unless you set HEADER = TRUE.

Why not A: SINGLE controls file count. It has no effect on whether a header row appears.

Why not C: OVERWRITE decides whether existing files are replaced. It does not add a header.

Why not D: COMPRESSION sets the codec. A GZIP default still leaves the header off on its own.

Q4: What is the default value of MAX_FILE_SIZE for COPY INTO <location>, and what does it control? ▸

Options:

A. 16 MB, an upper target for each generated file.
B. 5 GB, a fixed size that every file must reach.
C. 128 MB, the Parquet row-group size.
D. 256 MB, the limit for single-file mode.

✅ Answer: A

Why A: MAX_FILE_SIZE defaults to 16 MB. It is an upper target, not a guarantee. Files can finish smaller. A single set of rows can push a file slightly past the limit.

Why not B: 5 GB is the maximum supported value for an external stage, not the default. Nothing forces a file to reach it.

Why not C: 128 MB is the Parquet row-group size, a separate internal detail unrelated to the default.

Why not D: 256 MB is not a documented default. The intended target value is 16 MB.

Q5: Which statements about PARTITION BY in COPY INTO <location> are correct? (Select TWO) ▸

Options:

A. It writes partition values into the unloaded file names and paths.
B. It guarantees row order inside each file when combined with ORDER BY.
C. INCLUDE_QUERY_ID defaults to TRUE once PARTITION BY is set.
D. It only works when the unload target format is Parquet.
E. It requires SINGLE = TRUE to take effect.

✅ Answer: A and C

Why A and C: PARTITION BY splits rows into separate files by an expression. The partition values appear in the file paths and names. With partitioning on, INCLUDE_QUERY_ID defaults to TRUE, so a UUID is added to each filename.

Why not B: Pairing ORDER BY with PARTITION BY does not guarantee the order survives in the files.

Why not D: Partitioned unload works for CSV and JSON as well as Parquet. It is not Parquet-only.

Why not E: SINGLE = TRUE forces one file. That conflicts with splitting rows across partition files.

📝 Recap

Today you learned: COPY INTO <location> unloads a table or a query result to a stage. A query source can use JOIN, unlike a load transform. The default output is multiple GZIP-compressed CSV files in UTF-8. SINGLE = FALSE is the default, so a single file needs SINGLE = TRUE. MAX_FILE_SIZE defaults to 16 MB as an upper target, not a promise. HEADER and OVERWRITE both default to FALSE. On a Parquet unload, fixed-point columns become DECIMAL and floating-point columns become DOUBLE. PARTITION BY splits rows into a folder layout and adds a UUID to each filename.

Key takeaway: Three facts carry most of the Day 31 questions. The unload defaults are SINGLE = FALSE, HEADER = FALSE, and MAX_FILE_SIZE = 16 MB. The Parquet pair is fixed-point to DECIMAL, floating-point to DOUBLE. The directory-layout option is PARTITION BY. Get those three and the rest is option syntax.

Tomorrow (Day 32): Snowpipe and Snowpipe Streaming. We move from manual unloads to continuous ingest. Snowpipe is serverless and needs no warehouse. Its ON_ERROR default flips to SKIP_FILE, the number we flagged back on Day 30. Snowpipe Streaming drops latency to row-level inserts through a separate API.

← Day 30 All 50 Days Day 32 →

About the Author

Abhay Krishnan

Senior Data & AI Consultant

Connect on LinkedIn

With over five years of data engineering experience at EY and Infosys, Abhay Krishnan specializes in building scalable data pipelines and cloud warehousing solutions. He is a certified SnowPro Core professional, alongside credentials in AWS and Azure. Abhay created this 50-day track to solve a problem he faced firsthand: the lack of a structured, free resource for Snowflake certification prep. Follow him on LinkedIn for more data engineering insights.

Snowflake Data Unloading & Parquet Type Mapping

Today’s Concept

Micro-Concept 1: COPY INTO <location>, the Unload Direction

Micro-Concept 2: One File or Many, SINGLE and MAX_FILE_SIZE

Micro-Concept 3: HEADER and OVERWRITE

Micro-Concept 4: Parquet Type Mapping, the Fact With Its Own Question

Micro-Concept 5: Partitioned Unload With PARTITION BY

Cheat Sheet

Exam Tip

Hands-On Lab

Snowflake Docs

External References

Practice Questions

Topics