Snowflake Semi-Structured Data – VARIANT, FLATTEN & JSON Paths
Day 41 closed with one rule to carry forward. A JSON column name is case-insensitive, but a JSON key is case-sensitive. Today that rule earns its place. Semi-structured data is JSON, XML, Avro, Parquet, and ORC stored without a fixed schema. Snowflake holds it in the VARIANT data type. You load JSON with PARSE_JSON, read nested keys with colon and dot notation, then cast values with the double-colon operator. After that comes FLATTEN, the table function that turns an array into one row per element. LATERAL FLATTEN is the pattern the exam tests most. The post ends with the line between semi-structured data and unstructured files, which the exam keeps in separate answers.
This sub-objective brings JSON vocabulary into SQL. Each term sounds informal but has a precise meaning the exam holds you to.
| Term | What it sounds like | What it means in Snowflake |
|---|---|---|
| VARIANT | A flexible variable | A single column type that stores JSON, XML, Avro, Parquet, or ORC natively, keeping its nested shape. |
| PARSE_JSON | Read some JSON | A function that turns a JSON string into a VARIANT value you can query. |
| Colon operator | A list separator | Inside a VARIANT, it reaches a first-level key. data:name reads the name key. |
| Double colon | Two colons | Casts a VARIANT value to a real SQL type. data:age::INT returns an integer, not a VARIANT. |
| FLATTEN | Make something flat | A table function that explodes an array or object into one row per element. |
| LATERAL | Sideways | A join keyword that lets FLATTEN read the VARIANT column from each row to its left. |
The VARIANT Column and Path Notation
What VARIANT stores
A VARIANT is one column type that holds semi-structured data natively. JSON, XML, Avro, Parquet, and ORC all go into it. The data keeps its keys, arrays, and nesting. You do not define a column for each key in advance.
That flexibility is the point of the type. A single VARIANT cell can hold an object, and inside it an array, and inside that more objects. The schema lives in the data, not in the table definition.
Load JSON with PARSE_JSON
PARSE_JSON reads a string and returns a VARIANT. The text goes in as a literal and comes back as queryable structure. PARSE_JSON raises an error on invalid JSON. TRY_PARSE_JSON returns NULL instead, so one bad row does not stop a load.
One detail gets tested. PARSE_JSON of an empty string returns NULL, not an error. A string of literal null becomes a JSON null value, which is not the same as SQL NULL. These two NULL kinds are a known exam trap in Domain 4.
Read keys with colon, dot, and bracket
The colon reaches a first-level key: data:name. A dot or a bracket goes one level deeper: data:address.city or data:address['city']. All three notations read the same VARIANT. The result of a path is itself a VARIANT until you cast it.
The case rule decides path-equivalence questions. The column name is case-insensitive, so data:name and DATA:name are the same path. The key is case-sensitive, so data:Name is a different key. This is the most-tested point in the whole sub-objective.
Cast with the double colon
A path result is not a number or a string until you cast it. The double-colon operator casts: data:age::INT. Without the cast, the value stays VARIANT, and string values keep their JSON quotes in the output. Comparisons and joins behave correctly only on the cast value.
| Goal | Syntax | Result type |
|---|---|---|
| Reach a first-level key | data:name | VARIANT |
| Reach a nested key with a dot | data:address.city | VARIANT |
| Reach a nested key with a bracket | data:address['city'] | VARIANT |
| Cast a value to a SQL type | data:age::INT | INTEGER |
FLATTEN: Arrays Into Rows
Why FLATTEN exists
A VARIANT can hold an array inside one row. SQL reports work in rows, not in arrays held inside a cell. FLATTEN is the table function that splits an array into one row per element. Pair it with LATERAL and it reads the array from each source row.
The six output columns
FLATTEN returns a fixed set of columns. VALUE is the one you select most of the time. It holds each element. The others describe where the element sits.
| Column | What it holds |
|---|---|
| SEQ | A sequence number for the input row, not guaranteed gap-free |
| KEY | The key, for objects. NULL when the input is an array |
| PATH | The path to this element inside the structure |
| INDEX | The array position. NULL when the input is not an array |
| VALUE | The element itself. This is the column you select |
| THIS | The element being flattened, used in recursive flattening |
LATERAL and the OUTER default
LATERAL lets FLATTEN see the VARIANT column from each row to its left. Without LATERAL, a function in the FROM clause cannot reach a column from an earlier entry. By default OUTER is FALSE. A row whose array is empty or missing produces no output row. Set OUTER => TRUE to keep one row with NULL values instead.
SELECT data:name::STRING AS name,
f.value::STRING AS skill
FROM json_lab,
LATERAL FLATTEN(input => data:skills) f;Nested arrays need a second LATERAL FLATTEN
An array inside an array needs two FLATTEN calls. The second reads the output of the first. LATERAL is required, because the inner FLATTEN references o.value:items from the outer one. Chain the calls in the FROM clause, each reading the previous result.
SELECT o.value:id::INT AS order_id,
i.value:p::STRING AS product
FROM json_lab,
LATERAL FLATTEN(input => data:orders) o,
LATERAL FLATTEN(input => o.value:items) i;When to flatten on load
Querying a VARIANT casts values at run time. That cost is small for mostly regular data. For data with dates, timestamps, or arrays, flatten the keys into typed columns when you load. Typed columns prune better and store smaller. The exam phrases this as the best choice for query performance on date-and-array JSON.
Semi-Structured Is Not Unstructured
Semi-structured data has internal structure: keys, arrays, and nesting. It lives in a VARIANT column and you read it with paths and FLATTEN. Unstructured data is files with no queryable structure, such as PDFs, images, and audio. Snowflake does not parse those into a VARIANT.
Unstructured files sit on a stage. A directory table catalogues them and serves file URLs. A directory table stores file metadata such as the relative path, the file size, and the last-modified time. The exam keeps semi-structured and unstructured in separate answer choices, so read which one the stem describes.
| Aspect | Semi-structured | Unstructured |
|---|---|---|
| Examples | JSON, XML, Avro, Parquet, ORC | PDF, image, audio, video |
| Stored as | A VARIANT column | Files on a stage |
| Catalogued by | Table columns | A directory table |
| Read with | PARSE_JSON, paths, FLATTEN | File URLs and the FILE type |
Cheat Sheet
| Concept | What to remember | Exam keyword |
|---|---|---|
| VARIANT | One column type for JSON, XML, Avro, Parquet, and ORC. Keeps nesting | “stores semi-structured” |
| PARSE_JSON vs TRY | PARSE_JSON errors on bad JSON. TRY_PARSE_JSON returns NULL instead | “TRY returns NULL” |
| Empty string | PARSE_JSON of an empty string returns NULL, not an error | “empty is NULL” |
| Colon operator | Reaches a first-level key inside a VARIANT, as in data:key | “colon for path” |
| Dot vs bracket | data:a.b and data:a[‘b’] reach the same nested key | “same path” |
| Double colon | Casts a VARIANT value to a SQL type, as in data:age::INT | “cast with ::” |
| Case rule | Column name case-insensitive. JSON key case-sensitive | “key is case-sensitive” |
| FLATTEN VALUE | VALUE holds each element. INDEX holds the array position | “select VALUE” |
| LATERAL | Lets FLATTEN read the VARIANT from each left-side row | “LATERAL FLATTEN” |
| OUTER default | OUTER is FALSE. An empty or missing array drops the row | “OUTER FALSE drops” |
Exam Tip
Path equivalence turns on key case, never column case. A question shows several colon paths and asks which match. Ignore the case of the column. Check the case of every key. So data:customer.email equals SRC:customer.email, while data:Customer.Email is a different path.
For FLATTEN, the column that returns each array element is VALUE. A choice naming INDEX is the trap, because INDEX returns the position, not the element. KEY is NULL for arrays.
A nested-array question needs two FLATTEN calls with LATERAL. An option that drops LATERAL fails, because the inner FLATTEN cannot see the outer column. Same idea as the lateral join you saw building the lab.
A load that must survive a few malformed rows uses TRY_PARSE_JSON. PARSE_JSON aborts on the first invalid record. A stem about PDFs, images, or audio is unstructured data, so the answer involves a stage and a directory table, not a VARIANT column.
Hands-On Lab
lab_xs warehouse. This lab creates one table, json_lab, and drops it at the end. It does not touch any shared sample data.Create the table and load one JSON object. PARSE_JSON turns the string into a VARIANT.
USE WAREHOUSE lab_xs;
CREATE OR REPLACE TABLE json_lab (data VARIANT);
INSERT INTO json_lab
SELECT PARSE_JSON('{"name":"Alice","age":30,"skills":["SQL","Python"],"address":{"city":"NYC"}}');Read scalar keys and cast them. The colon reaches the key, the double colon casts the value.
SELECT data:name::STRING AS name,
data:age::INT AS age,
data:address.city::STRING AS city
FROM json_lab;::STRING from name and rerun. The value keeps its JSON quotes, because an uncast path stays VARIANT.Explode the skills array with LATERAL FLATTEN. One row becomes one row per skill.
SELECT data:name::STRING AS name,
f.value::STRING AS skill
FROM json_lab,
LATERAL FLATTEN(input => data:skills) f;Load a nested document and flatten twice. The inner FLATTEN reads the output of the outer one.
INSERT INTO json_lab
SELECT PARSE_JSON('{"orders":[{"id":1,"items":[{"p":"A","q":2},{"p":"B","q":1}]}]}');
SELECT o.value:id::INT AS order_id,
i.value:p::STRING AS product
FROM json_lab,
LATERAL FLATTEN(input => data:orders) o,
LATERAL FLATTEN(input => o.value:items) i;o.value:items, which is why LATERAL is required.See TRY_PARSE_JSON return NULL for a bad string. The second string is missing a closing brace.
SELECT PARSE_JSON('{"ok":1}') AS good,
TRY_PARSE_JSON('{"bad":1') AS safe;Cleanup. Drop the lab table and suspend the shared warehouse.
DROP TABLE json_lab;
ALTER WAREHOUSE lab_xs SUSPEND;
-- Keep lab_xs itself. It is the shared warehouse reused across days.Practice Questions
Options:
A. src[‘customer’][‘EMAIL’]
B. src[‘CUSTOMER’][‘Email’]
C. SRC:Customer.Email
D. src:customer.email
E. SRC:customer.email
Why D and E: The column name is case-insensitive, so src and SRC point at the same column. Both use the keys customer and email in identical case, so they resolve to one path.
Why not A: EMAIL differs in case from email. JSON keys are case-sensitive, so this is a different element.
Why not B: CUSTOMER and Email differ in case from customer and email, which makes a separate path.
Why not C: Customer and Email again differ in case from the lowercase keys, so this is a third distinct path.
Options:
A. KEY
B. INDEX
C. VALUE
D. PATH
Why C: VALUE holds the element produced for each output row. For an array, that is each item in turn.
Why not A: KEY holds the key name for objects. For an array element it is NULL.
Why not B: INDEX holds the array position, such as 0 or 1, not the element itself.
Why not D: PATH holds the location of the element inside the structure, not its content.
Options:
A. LATERAL sorts the rows before flattening
B. The second FLATTEN references the output of the first, which only LATERAL exposes
C. LATERAL casts the VARIANT to a string automatically
D. Without LATERAL, FLATTEN cannot produce the VALUE column
Why B: The inner FLATTEN reads o.value:items from the first call. LATERAL is what lets a later entry in the FROM clause reference an earlier one.
Why not A: LATERAL does not sort anything. Order comes from ORDER BY.
Why not C: No automatic cast happens. You cast a value yourself with the double colon.
Why not D: FLATTEN produces VALUE with or without LATERAL. LATERAL governs cross-references, not the output columns.
Options:
A. data:age
B. data.age
C. data:age::INT
D. data[‘age’]
Why C: The colon reaches the age key, and the double colon casts the value to INTEGER. Without the cast the value stays VARIANT.
Why not A: data:age returns a VARIANT, not an INTEGER. A numeric comparison may not behave as intended.
Why not B: A leading dot does not reach a top-level key. The colon does that job.
Why not D: data[‘age’] also returns a VARIANT. It reaches the key but performs no cast.
Options:
A. PARSE_JSON, because it ignores malformed input
B. TRY_PARSE_JSON, because it returns NULL on a parse error
C. CHECK_JSON, because it parses the valid rows
D. TO_JSON, because it converts strings to VARIANT
Why B: TRY_PARSE_JSON returns NULL when parsing fails, so a malformed row does not stop the statement.
Why not A: PARSE_JSON raises an error on invalid JSON. One bad row aborts the load.
Why not C: CHECK_JSON validates input and returns an error message for bad JSON. It does not return the parsed VARIANT.
Why not D: TO_JSON works the other way. It turns a VARIANT into a string.
Snowflake Documentation
Official docs for today’s topics. The exam pulls directly from these.
Today you learned: how Snowflake stores semi-structured data in a VARIANT column. PARSE_JSON loads a JSON string, and TRY_PARSE_JSON returns NULL on bad input. The colon reaches a first-level key, a dot or bracket goes deeper, and the double colon casts a value to a SQL type. The column name is case-insensitive, but a JSON key is case-sensitive. FLATTEN splits an array into one row per element, with VALUE holding each element. LATERAL FLATTEN reads the array from each source row, and a nested array needs a second LATERAL FLATTEN. By default OUTER is FALSE, so an empty or missing array drops the row.
Key takeaway: for a path-equivalence question, ignore the column case and check the key case. For top performance on JSON full of dates and arrays, flatten the keys into typed columns on load. Semi-structured data lives in a VARIANT column. Unstructured files live on a stage with a directory table. The exam treats those as separate answers.
Tomorrow (Day 43): Week 6 Recap with a mixed Domain 4 quiz. You revisit Query Profile, the three caches, Search Optimization, clustering, materialized views, window functions, QUALIFY, and FLATTEN. A set of ten mixed questions then tests the whole domain.
