D3: Data Loading Week 5

DAY 30

Snowflake File Formats & COPY INTO for Bulk Loading

Day 29 left the stages full of files. Today reads those files into tables. The command is COPY INTO <table>. One default behind it catches more candidates than any other fact in Domain 3. Miss the ON_ERROR default and you miss the most-repeated loading question on the exam.

🗣️ Plain-English First

Loading introduces six new terms. Each one shows up in exam stems with a meaning narrower than the everyday word suggests. Read this table before the concepts. Two of the rows (load metadata, PURGE) are tested directly.

Word you know	What it usually means	What it means in Snowflake
File format	A file type like .docx or .pdf	A named object that stores parsing rules: delimiter, header rows to skip, compression, encoding. You attach it to a load so Snowflake knows how to read the file.
COPY INTO	Paste something into a place	The bulk load command. It reads files from a stage, parses them with a file format, then writes rows into a table. The same command unloads in reverse: table to stage.
ON_ERROR	A generic “what to do on failure” toggle	The copy option that decides what happens when a row fails to parse. Five values exist. The default is different for bulk loading than for Snowpipe.
Load metadata	Notes about a file	Per-table records of which files were already loaded. Snowflake reads it to skip duplicates. It expires after 64 days.
PURGE	Delete everything in a sweep	A copy option that deletes the staged files after a successful load. Off by default. One-way: once the file is gone, your replay options shrink.
FORCE	Make something happen anyway	A copy option that loads files even when the load metadata says they were already loaded. It can duplicate rows in the target table.

Why the load-metadata row matters today: the 64-day window is the engine behind three separate exam questions. It explains why a reload returns zero rows. It explains why FORCE exists. It explains why LOAD_UNCERTAIN_FILES exists. Get the window and those three facts fall out of it.

📘

Today’s Concept

Micro-Concept 1: File Formats and the FILE FORMAT Object

Snowflake reads six file types: CSV, JSON, Avro, ORC, Parquet, and XML. CSV is the default. Run COPY INTO with no format specified and Snowflake assumes comma-delimited CSV in UTF-8.

A file format is the set of parsing rules for one of those types. You can write the rules inline on the COPY command. You can also store them once as a named object with CREATE FILE FORMAT. A named format is reusable across many loads. That is the recommended pattern for any load you run more than once.

SQL

-- A reusable named format
CREATE FILE FORMAT csv_format
  TYPE = CSV
  FIELD_DELIMITER = ','
  SKIP_HEADER = 1;

-- Same rules written inline, no stored object
COPY INTO my_table
  FROM @my_stage
  FILE_FORMAT = (TYPE = CSV FIELD_DELIMITER = ',' SKIP_HEADER = 1);

The exam tests recognition of the six types and the two ways to supply a format. A common stem hands you a JSON file and a CSV format object. The correct answer flags the type mismatch. Type and options must match the actual file.

Micro-Concept 2: The File Format Options That Get Tested

A handful of options carry almost all the exam weight. Learn these defaults. The default is what the stem assumes when an option is absent.

Option	Default	What it controls
`FIELD_DELIMITER`	`,` (comma)	The character between fields. Pipe and tab files set this explicitly.
`SKIP_HEADER`	`0`	Number of leading lines to skip. Set to 1 for a single header row.
`COMPRESSION`	`AUTO`	Snowflake detects gzip, bzip2, and others by file extension. Set `NONE` for plain text.
`ERROR_ON_COLUMN_COUNT_MISMATCH`	`TRUE`	Errors when a row’s column count differs from the table. The CSV trap below.
`FIELD_OPTIONALLY_ENCLOSED_BY`	none	The quote character around values that contain the delimiter, often `'"'`.

The option worth memorising is ERROR_ON_COLUMN_COUNT_MISMATCH. It defaults to TRUE for CSV. A file with the wrong delimiter collapses every row into one column. The column count no longer matches the table. The load fails before a single row lands. Setting the option to FALSE tolerates the mismatch instead of failing.

Micro-Concept 3: COPY INTO, the Load Command

The load direction of COPY INTO takes three things: a target table, a source stage, and a file format. The stage was today’s Day 29 subject. It now becomes the source side of every example.

SQL

-- Load every file in the stage path into the table
COPY INTO sales
  FROM @my_stage/inbox/
  FILE_FORMAT = (FORMAT_NAME = csv_format);

-- Load a named subset of files
COPY INTO sales
  FROM @my_stage
  FILES = ('jan.csv', 'feb.csv')
  FILE_FORMAT = (FORMAT_NAME = csv_format);

Two filters narrow which files load. FILES takes an explicit list, up to 1000 names. PATTERN takes a regular expression. The documentation advises against using both together. When both appear, only the FILES list loads.

Micro-Concept 4: ON_ERROR and Its Two Defaults

ON_ERROR decides what happens when a row fails to parse. Five values are valid:

→ ABORT_STATEMENT stops the whole load on the first bad row.
→ CONTINUE loads the good rows and reports the bad ones.
→ SKIP_FILE skips the entire file containing any error.
→ SKIP_FILE_n skips the file once its error count reaches n.
→ SKIP_FILE_n% skips the file once its error percentage reaches n.

The tested fact is the default. For bulk COPY INTO <table> the default is ABORT_STATEMENT. For Snowpipe (Day 32) the default flips to SKIP_FILE. Same option, two defaults, depending on how the load runs. In training sessions I have run, this single distinction trips more candidates than any other copy option.

The reasoning behind the split is practical. A bulk load is a deliberate, one-shot operation where a silent partial load would corrupt downstream tables. A Snowpipe load runs continuously on files you do not inspect first. Skipping one bad file keeps the pipe flowing. The exam states the rule plainly: bulk defaults to abort, Snowpipe defaults to skip.

Micro-Concept 5: Load History and the 64-Day Window

Snowflake records which files it has already loaded into each table. This load metadata lives in the target table’s metadata. It is what stops a second COPY INTO from loading the same file twice. Re-run an identical load and Snowflake reports zero files processed. The dedup is automatic.

The metadata expires after 64 days. Past that window the load status of an old file becomes uncertain. Snowflake’s response is to skip the file rather than risk a duplicate. Two conditions both have to be true for a file to count as uncertain. The file’s last-modified date is older than 64 days. The table’s first load also happened more than 64 days ago.

Two copy options override the skip. LOAD_UNCERTAIN_FILES = TRUE loads files whose status can no longer be determined. FORCE = TRUE loads every file regardless of status. The narrower tool is LOAD_UNCERTAIN_FILES. It touches only the expired files. FORCE reloads everything, including files that are clearly already loaded. Practice Q3 below tests both as the answer to a single multi-select.

One number worth tucking away for the Snowpipe day: Snowpipe stores its load history in the pipe object for 14 days, not 64. The 64-day figure is bulk loading only. The exam writes both numbers as distractors in the same question.

Micro-Concept 6: PURGE and FORCE

Two copy options manage the files after a load. They are easy to confuse. They do opposite jobs.

Option	Default	Effect
`PURGE = TRUE`	`FALSE`	Deletes the staged files after a successful load. If the delete itself fails, no error is raised.
`FORCE = TRUE`	`FALSE`	Loads files even when load metadata marks them as already loaded. Can duplicate rows.

PURGE is the stage-cleanup option. It keeps an internal stage from filling up over months of loads. The catch is silence: a failed purge returns no error. The documented safety check is to LIST the stage afterward and confirm the files are gone. In production projects I have worked on, FORCE = TRUE is the fastest way to silently duplicate millions of rows. It bypasses the exact dedup that load metadata provides.

Micro-Concept 7: Transforming Data During the Load

A COPY INTO can run a limited transformation by reading from a SELECT instead of a plain stage. This avoids a staging table for simple reshaping. The supported set is narrow.

Supported during a load: reordering columns, omitting columns, casting data types, truncating long strings, and applying scalar functions or scalar SQL UDFs. The source file and the target table need not share column count or order.

Not supported, and a frequent trap: WHERE filtering, JOIN, GROUP BY or any aggregate, FLATTEN, and the ORDER BY, LIMIT, TOP keywords. The DISTINCT keyword is not fully supported either. VALIDATION_MODE also stops working once a transformation is present.

SQL

-- Allowed: reorder, cast, and omit columns from a SELECT
COPY INTO trimmed_orders (order_id, status, amount)
  FROM (
    SELECT $1::INT, $3::STRING, $4::FLOAT
    FROM @my_stage/orders/
  )
  FILE_FORMAT = (FORMAT_NAME = csv_format);

-- Rejected: a WHERE filter during load raises an error
-- COPY INTO trimmed_orders FROM (SELECT $1 FROM @my_stage WHERE $1 > 100) ...

The exam phrases this as “which transformation can be applied during a load.” The wrong answers list JOIN, GROUP BY, or a WHERE filter. The right answers are reorder, cast, and omit. If a stem needs a join or an aggregate, the answer is to load first, then transform in SQL.

⚡

Cheat Sheet

Concept	What to remember	Exam keyword
File types	CSV, JSON, Avro, ORC, Parquet, XML. CSV is the default	“six file formats”
Format object	`CREATE FILE FORMAT` for reuse, or inline `TYPE = ...` on COPY	“FORMAT_NAME vs inline”
ERROR_ON_COLUMN_COUNT_MISMATCH	Default `TRUE` for CSV. Wrong delimiter fails the load	“column count mismatch”
ON_ERROR values	CONTINUE, SKIP_FILE, SKIP_FILE_n, SKIP_FILE_n%, ABORT_STATEMENT	“five ON_ERROR options”
ON_ERROR default (bulk)	`ABORT_STATEMENT`. Aborts on the first error	“bulk = ABORT_STATEMENT”
ON_ERROR default (Snowpipe)	`SKIP_FILE`. Keeps the pipe flowing	“Snowpipe = SKIP_FILE”
Load metadata window	64 days in the target table. Reload of a known file = 0 rows	“64 days”
Snowpipe load history	14 days in the pipe object, not 64	“pipe = 14 days”
LOAD_UNCERTAIN_FILES	Loads only files whose status expired. Default `FALSE`	“uncertain files”
FORCE	Loads all files regardless of history. Can duplicate rows	“FORCE = TRUE”
PURGE	Deletes staged files after a successful load. Failed purge is silent	“PURGE = TRUE”
Transform allowed	Reorder, omit, cast, truncate, scalar functions	“reorder, cast, omit”
Transform blocked	WHERE, JOIN, GROUP BY, FLATTEN, ORDER BY, LIMIT, DISTINCT	“no JOIN or aggregate”

🎯

Exam Tip

🎯 Exam Tip

Domain 3 loading questions cluster around three patterns. Each maps to one fact you can state in a sentence.

First: “What is the default ON_ERROR?” The trap is that the answer depends on the load type. Bulk COPY INTO defaults to ABORT_STATEMENT. Snowpipe defaults to SKIP_FILE. A stem that says “bulk loading” wants ABORT_STATEMENT. A stem that says “continuous” or “pipe” wants SKIP_FILE. Read for that word.

Second: “How do I reload a file after 64 days?” Two options work: LOAD_UNCERTAIN_FILES = TRUE and FORCE = TRUE. This is usually a multi-select expecting both. The distractors are “re-stage the file” and “truncate the table,” neither of which resets the file-level load history.

Third: “Which transformation runs during a load?” The answer set is reorder, omit, cast, truncate. The wrong answers are JOIN, GROUP BY, and a WHERE filter. None of those run inside a COPY. If the question needs them, the load happens first and the transform happens after.

🛠️

Hands-On Lab

Type: LAB (guided) | Time: ~20 minutes | Credits: <0.5 | Prerequisite: lab_xs warehouse (Day 1) and day10_orders table (Day 10). Both should still be present. This lab only reads day10_orders; it is never modified.

Set up the format, target table, and a stage. This lab is self-contained. It generates its own source files, so no manual upload is needed.

SQL

USE WAREHOUSE lab_xs;

CREATE OR REPLACE FILE FORMAT csv_format
  TYPE = CSV
  FIELD_DELIMITER = ','
  SKIP_HEADER = 1
  COMPRESSION = NONE;

CREATE OR REPLACE TABLE load_test (id INT, name STRING, amount FLOAT);

CREATE OR REPLACE STAGE day30_stage
  ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE');

👀 Observe: The file format object is created once and reused below by name. COMPRESSION = NONE keeps the generated files as plain text so you can inspect them. The SNOWFLAKE_SSE setting on the stage is the Day 29 habit carried forward.

Generate real source files by unloading from day10_orders. This produces an actual CSV in the stage. It reads the persistent table without changing it.

SQL

-- Unload 500 rows as a single CSV with a header line
COPY INTO @day30_stage/load/
  FROM (SELECT o_orderkey, o_orderstatus, o_totalprice
        FROM day10_orders LIMIT 500)
  FILE_FORMAT = (FORMAT_NAME = csv_format)
  HEADER = TRUE
  SINGLE = TRUE
  OVERWRITE = TRUE;

LIST @day30_stage/load/;

👀 Observe: LIST shows one file under the load/ path. HEADER = TRUE wrote a header row. The format’s SKIP_HEADER = 1 skips it on the way back in. The three unloaded columns map to load_test as id, name, and amount.

Run the first load and confirm the row count.

SQL

COPY INTO load_test
  FROM @day30_stage/load/
  FILE_FORMAT = (FORMAT_NAME = csv_format);

SELECT COUNT(*) AS rows_after_first_load FROM load_test;

👀 Observe: The COPY result shows the file loaded with 500 rows. rows_after_first_load returns 500. The status column reads LOADED.

Re-run the identical load and watch load metadata block it. This is the 64-day dedup in action.

SQL

COPY INTO load_test
  FROM @day30_stage/load/
  FILE_FORMAT = (FORMAT_NAME = csv_format);

SELECT COUNT(*) AS rows_after_reload FROM load_test;

👀 Observe: The COPY reports “Copy executed with 0 files processed.” rows_after_reload is still 500. The file was already loaded, so the load metadata skipped it. No duplicate landed.

Force a reload and watch the rows duplicate. This is exactly the production risk from Micro-Concept 6.

SQL

COPY INTO load_test
  FROM @day30_stage/load/
  FILE_FORMAT = (FORMAT_NAME = csv_format)
  FORCE = TRUE;

SELECT COUNT(*) AS rows_after_force FROM load_test;

👀 Observe: FORCE = TRUE bypasses the dedup. The same 500 rows load a second time. rows_after_force returns 1000. This is the silent-duplication trap made concrete.

Trigger the ON_ERROR default with a deliberately wrong delimiter. A pipe format on a comma file collapses each row into one column.

SQL

CREATE OR REPLACE FILE FORMAT pipe_format
  TYPE = CSV
  FIELD_DELIMITER = '|'
  SKIP_HEADER = 1;

-- Default ON_ERROR = ABORT_STATEMENT: the whole load aborts
COPY INTO load_test
  FROM @day30_stage/load/
  FILE_FORMAT = (FORMAT_NAME = pipe_format)
  FORCE = TRUE;
-- Error: column count mismatch, statement aborted

-- ON_ERROR = CONTINUE: bad rows reported, statement does not abort
COPY INTO load_test
  FROM @day30_stage/load/
  FILE_FORMAT = (FORMAT_NAME = pipe_format)
  ON_ERROR = CONTINUE
  FORCE = TRUE;

👀 Observe: The first statement aborts on the first bad row. That is ABORT_STATEMENT, the bulk default, with nothing set. The second statement runs to completion and reports the errors instead. The only change between them is the explicit ON_ERROR = CONTINUE.

Demonstrate PURGE, then clean up. A forced load with PURGE deletes the staged file after it succeeds.

SQL

-- FORCE makes it load again; PURGE deletes the file after success
COPY INTO load_test
  FROM @day30_stage/load/
  FILE_FORMAT = (FORMAT_NAME = csv_format)
  FORCE = TRUE
  PURGE = TRUE;

LIST @day30_stage/load/;   -- the file is gone

-- Cleanup
DROP TABLE IF EXISTS load_test;
DROP FILE FORMAT IF EXISTS csv_format;
DROP FILE FORMAT IF EXISTS pipe_format;
DROP STAGE IF EXISTS day30_stage;

-- Confirm the persistent table is untouched (Day 31, 32, 35 use it)
SELECT COUNT(*) AS row_count FROM day10_orders;

👀 Observe: After the PURGE load, LIST returns no files. The stage path is empty. The final SELECT COUNT(*) on day10_orders returns roughly 1.5 million rows, confirming the persistent Day 10 table is intact for tomorrow.

📚

Snowflake Docs

Authoritative references for every fact in today’s post. The COPY INTO <table> page and the load-metadata page are the two worth re-reading the day before the exam.

❄️

COPY INTO <table> Reference

docs.snowflake.com/en/sql-reference/sql/copy-into-table

→

❄️

CREATE FILE FORMAT Reference

docs.snowflake.com/en/sql-reference/sql/create-file-format

→

❄️

Loading Data: Load Metadata and the 64-Day Window

docs.snowflake.com/en/user-guide/data-load-considerations-load

→

🔗

External References

Two Snowflake guides that go deeper on load transformations and file preparation.

Guide Snowflake Documentation

Transform Data During a Load: Supported and Unsupported Operations

Guide Snowflake Documentation

Preparing Your Data Files: Delimiters, Compression, and File Sizing

❓

Practice Questions

Q1: What is the default behavior for the ON_ERROR parameter when bulk loading using the COPY INTO <table> command? ▸

Options:

A. ABORT_STATEMENT
B. SKIP_FILE
C. CONTINUE
D. SKIP_FILE_3

✅ Answer: A

Why A: For bulk loading with COPY INTO <table>, the documented default is ABORT_STATEMENT. It aborts the entire COPY operation on the first error encountered in any file. Nothing loads when an error is hit.

Why not B: SKIP_FILE is the default for Snowpipe, not for bulk loading. The question specifies bulk COPY INTO, so this is the classic load-type trap.

Why not C: CONTINUE is a valid value but not a default. It must be set explicitly to load good rows and report the bad ones.

Why not D: SKIP_FILE_3 is a valid value, skipping a file once it hits three error rows. It is never a default.

Q2: What parameter or option can be used in the COPY INTO <table> command to delete files from a stage after the files have been successfully loaded? ▸

Options:

A. FORCE
B. FILES
C. PURGE
D. OVERWRITE

✅ Answer: C

Why C: PURGE = TRUE instructs Snowflake to delete the staged files after they load successfully into the target table. It is the standard way to clean up source files post-load.

Why not A: FORCE reloads files regardless of load history. It never deletes anything.

Why not B: FILES specifies an explicit list of files to load. It controls input selection, not cleanup.

Why not D: OVERWRITE applies to COPY INTO <location> for unloading. It does not delete files after a load.

Q3: How can files in a stage be loaded after the file metadata has expired? (Select TWO) ▸

Options:

A. Set the LOAD_UNCERTAIN_FILES option to TRUE.
B. Remove the files from the stage and reload them.
C. Set the FORCE parameter to TRUE.
D. Truncate the target table and reload the files.
E. Use the ALTER TABLE...REFRESH command.

✅ Answer: A and C

Why A and C: Snowflake keeps per-file load history for 64 days, after which the metadata expires. Two copy options reload such files. LOAD_UNCERTAIN_FILES = TRUE loads files whose status can no longer be determined. FORCE = TRUE loads files regardless of whether they were loaded before.

Why not B: Removing and re-staging the file does not reset the file-level load history that COPY checks. The status stays as it was.

Why not D: Truncating the target table clears the rows, not the file-load metadata. The COPY still skips the expired file.

Why not E: ALTER TABLE...REFRESH applies to external tables, not to stage-load metadata. It has no effect here.

Q4: For bulk loading with COPY INTO <table>, where is the load history stored, and for how long? ▸

Options:

A. In the metadata of the target table for 64 days.
B. In the metadata of the pipe for 14 days.
C. In the metadata of the target table for 14 days.
D. In the ACCOUNT_USAGE schema for 365 days.

✅ Answer: A

Why A: Bulk load history lives in the target table’s metadata, where it is retained for 64 days. Within that window, COPY can tell whether a file was already loaded and skip the duplicate. After 64 days the load status becomes uncertain.

Why not B: The pipe and the 14-day figure describe Snowpipe, not bulk loading. This is the standard bulk-versus-Snowpipe distractor pair.

Why not C: The location is right but the duration is wrong. 14 days is the Snowpipe number, not the 64-day bulk number.

Why not D: File-level load history is not the same as the ACCOUNT_USAGE.COPY_HISTORY view (Day 27). The dedup metadata is table-level and capped at 64 days.

Q5: Which operations can be performed as a transformation during a COPY INTO <table> load? (Select TWO) ▸

Options:

A. Reorder and omit columns using a SELECT.
B. Filter rows with a WHERE clause.
C. Cast a column to a different data type.
D. Join two staged files together.
E. Aggregate rows with GROUP BY.

✅ Answer: A and C

Why A and C: COPY transformations support a narrow set of reshaping: reordering columns, omitting columns, casting types, truncating strings, and scalar functions. Reading from a SELECT over the stage is how you apply them. The source file need not match the table’s column count or order.

Why not B: A WHERE filter is not supported during a load. The same applies to ORDER BY, LIMIT, and TOP.

Why not D: JOIN syntax is not supported in a COPY transformation. Combine files after loading instead.

Why not E: GROUP BY and any aggregate are blocked, as is FLATTEN. Load first, then aggregate in SQL.

📝 Recap

Today you learned: COPY INTO <table> reads staged files into a table using a file format. Snowflake supports six file types, with CSV as the default. ON_ERROR defaults to ABORT_STATEMENT for bulk loading and SKIP_FILE for Snowpipe. Load history sits in the target table for 64 days and prevents duplicate loads. After 64 days a file’s status becomes uncertain. LOAD_UNCERTAIN_FILES = TRUE or FORCE = TRUE reloads it. PURGE = TRUE deletes staged files after a successful load. A load can reorder, omit, and cast columns, but it cannot filter, join, or aggregate.

Key takeaway: Three facts carry most of the Day 30 questions. The ON_ERROR default splits by load type (bulk aborts, Snowpipe skips). The 64-day window drives FORCE and LOAD_UNCERTAIN_FILES. Load transformations are reshaping only, never filtering or joining. Get those three and the rest is option syntax.

Tomorrow (Day 31): Data unloading and Parquet type mapping. We reverse the direction with COPY INTO <location>. The options that matter are single versus multiple output files (SINGLE, MAX_FILE_SIZE) and the HEADER flag. One Parquet fact gets its own exam question: fixed-point columns unload to DECIMAL, not DOUBLE. The day10_orders table becomes the source we unload from.

← Day 29 All 50 Days Day 31 →

About the Author

Abhay Krishnan

Senior Data & AI Consultant

Connect on LinkedIn

With over five years of data engineering experience at EY and Infosys, Abhay Krishnan specializes in building scalable data pipelines and cloud warehousing solutions. He is a certified SnowPro Core professional, alongside credentials in AWS and Azure. Abhay created this 50-day track to solve a problem he faced firsthand: the lack of a structured, free resource for Snowflake certification prep. Follow him on LinkedIn for more data engineering insights.

Snowflake File Formats & COPY INTO for Bulk Loading

Today’s Concept

Micro-Concept 1: File Formats and the FILE FORMAT Object

Micro-Concept 2: The File Format Options That Get Tested

Micro-Concept 3: COPY INTO, the Load Command

Micro-Concept 4: ON_ERROR and Its Two Defaults

Micro-Concept 5: Load History and the 64-Day Window

Micro-Concept 6: PURGE and FORCE

Micro-Concept 7: Transforming Data During the Load

Cheat Sheet

Exam Tip

Hands-On Lab

Snowflake Docs

External References

Practice Questions

Topics