Snowflake File Formats & COPY INTO for Bulk Loading
Day 29 left the stages full of files. Today reads those files into tables. The command is COPY INTO <table>. One default behind it catches more candidates than any other fact in Domain 3. Miss the ON_ERROR default and you miss the most-repeated loading question on the exam.
Loading introduces six new terms. Each one shows up in exam stems with a meaning narrower than the everyday word suggests. Read this table before the concepts. Two of the rows (load metadata, PURGE) are tested directly.
| Word you know | What it usually means | What it means in Snowflake |
|---|---|---|
| File format | A file type like .docx or .pdf | A named object that stores parsing rules: delimiter, header rows to skip, compression, encoding. You attach it to a load so Snowflake knows how to read the file. |
| COPY INTO | Paste something into a place | The bulk load command. It reads files from a stage, parses them with a file format, then writes rows into a table. The same command unloads in reverse: table to stage. |
| ON_ERROR | A generic “what to do on failure” toggle | The copy option that decides what happens when a row fails to parse. Five values exist. The default is different for bulk loading than for Snowpipe. |
| Load metadata | Notes about a file | Per-table records of which files were already loaded. Snowflake reads it to skip duplicates. It expires after 64 days. |
| PURGE | Delete everything in a sweep | A copy option that deletes the staged files after a successful load. Off by default. One-way: once the file is gone, your replay options shrink. |
| FORCE | Make something happen anyway | A copy option that loads files even when the load metadata says they were already loaded. It can duplicate rows in the target table. |
Why the load-metadata row matters today: the 64-day window is the engine behind three separate exam questions. It explains why a reload returns zero rows. It explains why FORCE exists. It explains why LOAD_UNCERTAIN_FILES exists. Get the window and those three facts fall out of it.
Today’s Concept
Micro-Concept 1: File Formats and the FILE FORMAT Object
Snowflake reads six file types: CSV, JSON, Avro, ORC, Parquet, and XML. CSV is the default. Run COPY INTO with no format specified and Snowflake assumes comma-delimited CSV in UTF-8.
A file format is the set of parsing rules for one of those types. You can write the rules inline on the COPY command. You can also store them once as a named object with CREATE FILE FORMAT. A named format is reusable across many loads. That is the recommended pattern for any load you run more than once.
-- A reusable named format
CREATE FILE FORMAT csv_format
TYPE = CSV
FIELD_DELIMITER = ','
SKIP_HEADER = 1;
-- Same rules written inline, no stored object
COPY INTO my_table
FROM @my_stage
FILE_FORMAT = (TYPE = CSV FIELD_DELIMITER = ',' SKIP_HEADER = 1);The exam tests recognition of the six types and the two ways to supply a format. A common stem hands you a JSON file and a CSV format object. The correct answer flags the type mismatch. Type and options must match the actual file.
Micro-Concept 2: The File Format Options That Get Tested
A handful of options carry almost all the exam weight. Learn these defaults. The default is what the stem assumes when an option is absent.
| Option | Default | What it controls |
|---|---|---|
FIELD_DELIMITER | , (comma) | The character between fields. Pipe and tab files set this explicitly. |
SKIP_HEADER | 0 | Number of leading lines to skip. Set to 1 for a single header row. |
COMPRESSION | AUTO | Snowflake detects gzip, bzip2, and others by file extension. Set NONE for plain text. |
ERROR_ON_COLUMN_COUNT_MISMATCH | TRUE | Errors when a row’s column count differs from the table. The CSV trap below. |
FIELD_OPTIONALLY_ENCLOSED_BY | none | The quote character around values that contain the delimiter, often '"'. |
The option worth memorising is ERROR_ON_COLUMN_COUNT_MISMATCH. It defaults to TRUE for CSV. A file with the wrong delimiter collapses every row into one column. The column count no longer matches the table. The load fails before a single row lands. Setting the option to FALSE tolerates the mismatch instead of failing.
Micro-Concept 3: COPY INTO, the Load Command
The load direction of COPY INTO takes three things: a target table, a source stage, and a file format. The stage was today’s Day 29 subject. It now becomes the source side of every example.
-- Load every file in the stage path into the table
COPY INTO sales
FROM @my_stage/inbox/
FILE_FORMAT = (FORMAT_NAME = csv_format);
-- Load a named subset of files
COPY INTO sales
FROM @my_stage
FILES = ('jan.csv', 'feb.csv')
FILE_FORMAT = (FORMAT_NAME = csv_format);Two filters narrow which files load. FILES takes an explicit list, up to 1000 names. PATTERN takes a regular expression. The documentation advises against using both together. When both appear, only the FILES list loads.
Micro-Concept 4: ON_ERROR and Its Two Defaults
ON_ERROR decides what happens when a row fails to parse. Five values are valid:
→ ABORT_STATEMENT stops the whole load on the first bad row.
→ CONTINUE loads the good rows and reports the bad ones.
→ SKIP_FILE skips the entire file containing any error.
→ SKIP_FILE_n skips the file once its error count reaches n.
→ SKIP_FILE_n% skips the file once its error percentage reaches n.
The tested fact is the default. For bulk COPY INTO <table> the default is ABORT_STATEMENT. For Snowpipe (Day 32) the default flips to SKIP_FILE. Same option, two defaults, depending on how the load runs. In training sessions I have run, this single distinction trips more candidates than any other copy option.
The reasoning behind the split is practical. A bulk load is a deliberate, one-shot operation where a silent partial load would corrupt downstream tables. A Snowpipe load runs continuously on files you do not inspect first. Skipping one bad file keeps the pipe flowing. The exam states the rule plainly: bulk defaults to abort, Snowpipe defaults to skip.
Micro-Concept 5: Load History and the 64-Day Window
Snowflake records which files it has already loaded into each table. This load metadata lives in the target table’s metadata. It is what stops a second COPY INTO from loading the same file twice. Re-run an identical load and Snowflake reports zero files processed. The dedup is automatic.
The metadata expires after 64 days. Past that window the load status of an old file becomes uncertain. Snowflake’s response is to skip the file rather than risk a duplicate. Two conditions both have to be true for a file to count as uncertain. The file’s last-modified date is older than 64 days. The table’s first load also happened more than 64 days ago.
Two copy options override the skip. LOAD_UNCERTAIN_FILES = TRUE loads files whose status can no longer be determined. FORCE = TRUE loads every file regardless of status. The narrower tool is LOAD_UNCERTAIN_FILES. It touches only the expired files. FORCE reloads everything, including files that are clearly already loaded. Practice Q3 below tests both as the answer to a single multi-select.
One number worth tucking away for the Snowpipe day: Snowpipe stores its load history in the pipe object for 14 days, not 64. The 64-day figure is bulk loading only. The exam writes both numbers as distractors in the same question.
Micro-Concept 6: PURGE and FORCE
Two copy options manage the files after a load. They are easy to confuse. They do opposite jobs.
| Option | Default | Effect |
|---|---|---|
PURGE = TRUE | FALSE | Deletes the staged files after a successful load. If the delete itself fails, no error is raised. |
FORCE = TRUE | FALSE | Loads files even when load metadata marks them as already loaded. Can duplicate rows. |
PURGE is the stage-cleanup option. It keeps an internal stage from filling up over months of loads. The catch is silence: a failed purge returns no error. The documented safety check is to LIST the stage afterward and confirm the files are gone. In production projects I have worked on, FORCE = TRUE is the fastest way to silently duplicate millions of rows. It bypasses the exact dedup that load metadata provides.
Micro-Concept 7: Transforming Data During the Load
A COPY INTO can run a limited transformation by reading from a SELECT instead of a plain stage. This avoids a staging table for simple reshaping. The supported set is narrow.
Supported during a load: reordering columns, omitting columns, casting data types, truncating long strings, and applying scalar functions or scalar SQL UDFs. The source file and the target table need not share column count or order.
Not supported, and a frequent trap: WHERE filtering, JOIN, GROUP BY or any aggregate, FLATTEN, and the ORDER BY, LIMIT, TOP keywords. The DISTINCT keyword is not fully supported either. VALIDATION_MODE also stops working once a transformation is present.
-- Allowed: reorder, cast, and omit columns from a SELECT
COPY INTO trimmed_orders (order_id, status, amount)
FROM (
SELECT $1::INT, $3::STRING, $4::FLOAT
FROM @my_stage/orders/
)
FILE_FORMAT = (FORMAT_NAME = csv_format);
-- Rejected: a WHERE filter during load raises an error
-- COPY INTO trimmed_orders FROM (SELECT $1 FROM @my_stage WHERE $1 > 100) ...The exam phrases this as “which transformation can be applied during a load.” The wrong answers list JOIN, GROUP BY, or a WHERE filter. The right answers are reorder, cast, and omit. If a stem needs a join or an aggregate, the answer is to load first, then transform in SQL.
Cheat Sheet
| Concept | What to remember | Exam keyword |
|---|---|---|
| File types | CSV, JSON, Avro, ORC, Parquet, XML. CSV is the default | “six file formats” |
| Format object | CREATE FILE FORMAT for reuse, or inline TYPE = ... on COPY | “FORMAT_NAME vs inline” |
| ERROR_ON_COLUMN_COUNT_MISMATCH | Default TRUE for CSV. Wrong delimiter fails the load | “column count mismatch” |
| ON_ERROR values | CONTINUE, SKIP_FILE, SKIP_FILE_n, SKIP_FILE_n%, ABORT_STATEMENT | “five ON_ERROR options” |
| ON_ERROR default (bulk) | ABORT_STATEMENT. Aborts on the first error | “bulk = ABORT_STATEMENT” |
| ON_ERROR default (Snowpipe) | SKIP_FILE. Keeps the pipe flowing | “Snowpipe = SKIP_FILE” |
| Load metadata window | 64 days in the target table. Reload of a known file = 0 rows | “64 days” |
| Snowpipe load history | 14 days in the pipe object, not 64 | “pipe = 14 days” |
| LOAD_UNCERTAIN_FILES | Loads only files whose status expired. Default FALSE | “uncertain files” |
| FORCE | Loads all files regardless of history. Can duplicate rows | “FORCE = TRUE” |
| PURGE | Deletes staged files after a successful load. Failed purge is silent | “PURGE = TRUE” |
| Transform allowed | Reorder, omit, cast, truncate, scalar functions | “reorder, cast, omit” |
| Transform blocked | WHERE, JOIN, GROUP BY, FLATTEN, ORDER BY, LIMIT, DISTINCT | “no JOIN or aggregate” |
Exam Tip
Domain 3 loading questions cluster around three patterns. Each maps to one fact you can state in a sentence.
First: “What is the default ON_ERROR?” The trap is that the answer depends on the load type. Bulk COPY INTO defaults to ABORT_STATEMENT. Snowpipe defaults to SKIP_FILE. A stem that says “bulk loading” wants ABORT_STATEMENT. A stem that says “continuous” or “pipe” wants SKIP_FILE. Read for that word.
Second: “How do I reload a file after 64 days?” Two options work: LOAD_UNCERTAIN_FILES = TRUE and FORCE = TRUE. This is usually a multi-select expecting both. The distractors are “re-stage the file” and “truncate the table,” neither of which resets the file-level load history.
Third: “Which transformation runs during a load?” The answer set is reorder, omit, cast, truncate. The wrong answers are JOIN, GROUP BY, and a WHERE filter. None of those run inside a COPY. If the question needs them, the load happens first and the transform happens after.
Hands-On Lab
lab_xs warehouse (Day 1) and day10_orders table (Day 10). Both should still be present. This lab only reads day10_orders; it is never modified.Set up the format, target table, and a stage. This lab is self-contained. It generates its own source files, so no manual upload is needed.
USE WAREHOUSE lab_xs;
CREATE OR REPLACE FILE FORMAT csv_format
TYPE = CSV
FIELD_DELIMITER = ','
SKIP_HEADER = 1
COMPRESSION = NONE;
CREATE OR REPLACE TABLE load_test (id INT, name STRING, amount FLOAT);
CREATE OR REPLACE STAGE day30_stage
ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE');COMPRESSION = NONE keeps the generated files as plain text so you can inspect them. The SNOWFLAKE_SSE setting on the stage is the Day 29 habit carried forward.Generate real source files by unloading from day10_orders. This produces an actual CSV in the stage. It reads the persistent table without changing it.
-- Unload 500 rows as a single CSV with a header line
COPY INTO @day30_stage/load/
FROM (SELECT o_orderkey, o_orderstatus, o_totalprice
FROM day10_orders LIMIT 500)
FILE_FORMAT = (FORMAT_NAME = csv_format)
HEADER = TRUE
SINGLE = TRUE
OVERWRITE = TRUE;
LIST @day30_stage/load/;LIST shows one file under the load/ path. HEADER = TRUE wrote a header row. The format’s SKIP_HEADER = 1 skips it on the way back in. The three unloaded columns map to load_test as id, name, and amount.Run the first load and confirm the row count.
COPY INTO load_test
FROM @day30_stage/load/
FILE_FORMAT = (FORMAT_NAME = csv_format);
SELECT COUNT(*) AS rows_after_first_load FROM load_test;rows_after_first_load returns 500. The status column reads LOADED.Re-run the identical load and watch load metadata block it. This is the 64-day dedup in action.
COPY INTO load_test
FROM @day30_stage/load/
FILE_FORMAT = (FORMAT_NAME = csv_format);
SELECT COUNT(*) AS rows_after_reload FROM load_test;rows_after_reload is still 500. The file was already loaded, so the load metadata skipped it. No duplicate landed.Force a reload and watch the rows duplicate. This is exactly the production risk from Micro-Concept 6.
COPY INTO load_test
FROM @day30_stage/load/
FILE_FORMAT = (FORMAT_NAME = csv_format)
FORCE = TRUE;
SELECT COUNT(*) AS rows_after_force FROM load_test;FORCE = TRUE bypasses the dedup. The same 500 rows load a second time. rows_after_force returns 1000. This is the silent-duplication trap made concrete.Trigger the ON_ERROR default with a deliberately wrong delimiter. A pipe format on a comma file collapses each row into one column.
CREATE OR REPLACE FILE FORMAT pipe_format
TYPE = CSV
FIELD_DELIMITER = '|'
SKIP_HEADER = 1;
-- Default ON_ERROR = ABORT_STATEMENT: the whole load aborts
COPY INTO load_test
FROM @day30_stage/load/
FILE_FORMAT = (FORMAT_NAME = pipe_format)
FORCE = TRUE;
-- Error: column count mismatch, statement aborted
-- ON_ERROR = CONTINUE: bad rows reported, statement does not abort
COPY INTO load_test
FROM @day30_stage/load/
FILE_FORMAT = (FORMAT_NAME = pipe_format)
ON_ERROR = CONTINUE
FORCE = TRUE;ABORT_STATEMENT, the bulk default, with nothing set. The second statement runs to completion and reports the errors instead. The only change between them is the explicit ON_ERROR = CONTINUE.Demonstrate PURGE, then clean up. A forced load with PURGE deletes the staged file after it succeeds.
-- FORCE makes it load again; PURGE deletes the file after success
COPY INTO load_test
FROM @day30_stage/load/
FILE_FORMAT = (FORMAT_NAME = csv_format)
FORCE = TRUE
PURGE = TRUE;
LIST @day30_stage/load/; -- the file is gone
-- Cleanup
DROP TABLE IF EXISTS load_test;
DROP FILE FORMAT IF EXISTS csv_format;
DROP FILE FORMAT IF EXISTS pipe_format;
DROP STAGE IF EXISTS day30_stage;
-- Confirm the persistent table is untouched (Day 31, 32, 35 use it)
SELECT COUNT(*) AS row_count FROM day10_orders;PURGE load, LIST returns no files. The stage path is empty. The final SELECT COUNT(*) on day10_orders returns roughly 1.5 million rows, confirming the persistent Day 10 table is intact for tomorrow.Snowflake Docs
Authoritative references for every fact in today’s post. The COPY INTO <table> page and the load-metadata page are the two worth re-reading the day before the exam.
External References
Two Snowflake guides that go deeper on load transformations and file preparation.
Practice Questions
Options:
A. ABORT_STATEMENT
B. SKIP_FILE
C. CONTINUE
D. SKIP_FILE_3
Why A: For bulk loading with COPY INTO <table>, the documented default is ABORT_STATEMENT. It aborts the entire COPY operation on the first error encountered in any file. Nothing loads when an error is hit.
Why not B: SKIP_FILE is the default for Snowpipe, not for bulk loading. The question specifies bulk COPY INTO, so this is the classic load-type trap.
Why not C: CONTINUE is a valid value but not a default. It must be set explicitly to load good rows and report the bad ones.
Why not D: SKIP_FILE_3 is a valid value, skipping a file once it hits three error rows. It is never a default.
Options:
A. FORCE
B. FILES
C. PURGE
D. OVERWRITE
Why C: PURGE = TRUE instructs Snowflake to delete the staged files after they load successfully into the target table. It is the standard way to clean up source files post-load.
Why not A: FORCE reloads files regardless of load history. It never deletes anything.
Why not B: FILES specifies an explicit list of files to load. It controls input selection, not cleanup.
Why not D: OVERWRITE applies to COPY INTO <location> for unloading. It does not delete files after a load.
Options:
A. Set the LOAD_UNCERTAIN_FILES option to TRUE.
B. Remove the files from the stage and reload them.
C. Set the FORCE parameter to TRUE.
D. Truncate the target table and reload the files.
E. Use the ALTER TABLE...REFRESH command.
Why A and C: Snowflake keeps per-file load history for 64 days, after which the metadata expires. Two copy options reload such files. LOAD_UNCERTAIN_FILES = TRUE loads files whose status can no longer be determined. FORCE = TRUE loads files regardless of whether they were loaded before.
Why not B: Removing and re-staging the file does not reset the file-level load history that COPY checks. The status stays as it was.
Why not D: Truncating the target table clears the rows, not the file-load metadata. The COPY still skips the expired file.
Why not E: ALTER TABLE...REFRESH applies to external tables, not to stage-load metadata. It has no effect here.
Options:
A. In the metadata of the target table for 64 days.
B. In the metadata of the pipe for 14 days.
C. In the metadata of the target table for 14 days.
D. In the ACCOUNT_USAGE schema for 365 days.
Why A: Bulk load history lives in the target table’s metadata, where it is retained for 64 days. Within that window, COPY can tell whether a file was already loaded and skip the duplicate. After 64 days the load status becomes uncertain.
Why not B: The pipe and the 14-day figure describe Snowpipe, not bulk loading. This is the standard bulk-versus-Snowpipe distractor pair.
Why not C: The location is right but the duration is wrong. 14 days is the Snowpipe number, not the 64-day bulk number.
Why not D: File-level load history is not the same as the ACCOUNT_USAGE.COPY_HISTORY view (Day 27). The dedup metadata is table-level and capped at 64 days.
Options:
A. Reorder and omit columns using a SELECT.
B. Filter rows with a WHERE clause.
C. Cast a column to a different data type.
D. Join two staged files together.
E. Aggregate rows with GROUP BY.
Why A and C: COPY transformations support a narrow set of reshaping: reordering columns, omitting columns, casting types, truncating strings, and scalar functions. Reading from a SELECT over the stage is how you apply them. The source file need not match the table’s column count or order.
Why not B: A WHERE filter is not supported during a load. The same applies to ORDER BY, LIMIT, and TOP.
Why not D: JOIN syntax is not supported in a COPY transformation. Combine files after loading instead.
Why not E: GROUP BY and any aggregate are blocked, as is FLATTEN. Load first, then aggregate in SQL.
Today you learned: COPY INTO <table> reads staged files into a table using a file format. Snowflake supports six file types, with CSV as the default. ON_ERROR defaults to ABORT_STATEMENT for bulk loading and SKIP_FILE for Snowpipe. Load history sits in the target table for 64 days and prevents duplicate loads. After 64 days a file’s status becomes uncertain. LOAD_UNCERTAIN_FILES = TRUE or FORCE = TRUE reloads it. PURGE = TRUE deletes staged files after a successful load. A load can reorder, omit, and cast columns, but it cannot filter, join, or aggregate.
Key takeaway: Three facts carry most of the Day 30 questions. The ON_ERROR default splits by load type (bulk aborts, Snowpipe skips). The 64-day window drives FORCE and LOAD_UNCERTAIN_FILES. Load transformations are reshaping only, never filtering or joining. Get those three and the rest is option syntax.
Tomorrow (Day 31): Data unloading and Parquet type mapping. We reverse the direction with COPY INTO <location>. The options that matter are single versus multiple output files (SINGLE, MAX_FILE_SIZE) and the HEADER flag. One Parquet fact gets its own exam question: fixed-point columns unload to DECIMAL, not DOUBLE. The day10_orders table becomes the source we unload from.
