D4: Performance & Querying Week 6

DAY 42

Snowflake Semi-Structured Data – VARIANT, FLATTEN & JSON Paths

Day 41 closed with one rule to carry forward. A JSON column name is case-insensitive, but a JSON key is case-sensitive. Today that rule earns its place. Semi-structured data is JSON, XML, Avro, Parquet, and ORC stored without a fixed schema. Snowflake holds it in the VARIANT data type. You load JSON with PARSE_JSON, read nested keys with colon and dot notation, then cast values with the double-colon operator. After that comes FLATTEN, the table function that turns an array into one row per element. LATERAL FLATTEN is the pattern the exam tests most. The post ends with the line between semi-structured data and unstructured files, which the exam keeps in separate answers.

🗣️ Plain-English First

This sub-objective brings JSON vocabulary into SQL. Each term sounds informal but has a precise meaning the exam holds you to.

Term	What it sounds like	What it means in Snowflake
VARIANT	A flexible variable	A single column type that stores JSON, XML, Avro, Parquet, or ORC natively, keeping its nested shape.
PARSE_JSON	Read some JSON	A function that turns a JSON string into a VARIANT value you can query.
Colon operator	A list separator	Inside a VARIANT, it reaches a first-level key. `data:name` reads the name key.
Double colon	Two colons	Casts a VARIANT value to a real SQL type. `data:age::INT` returns an integer, not a VARIANT.
FLATTEN	Make something flat	A table function that explodes an array or object into one row per element.
LATERAL	Sideways	A join keyword that lets FLATTEN read the VARIANT column from each row to its left.

📘

The VARIANT Column and Path Notation

What VARIANT stores

A VARIANT is one column type that holds semi-structured data natively. JSON, XML, Avro, Parquet, and ORC all go into it. The data keeps its keys, arrays, and nesting. You do not define a column for each key in advance.

That flexibility is the point of the type. A single VARIANT cell can hold an object, and inside it an array, and inside that more objects. The schema lives in the data, not in the table definition.

Load JSON with PARSE_JSON

PARSE_JSON reads a string and returns a VARIANT. The text goes in as a literal and comes back as queryable structure. PARSE_JSON raises an error on invalid JSON. TRY_PARSE_JSON returns NULL instead, so one bad row does not stop a load.

One detail gets tested. PARSE_JSON of an empty string returns NULL, not an error. A string of literal null becomes a JSON null value, which is not the same as SQL NULL. These two NULL kinds are a known exam trap in Domain 4.

Read keys with colon, dot, and bracket

The colon reaches a first-level key: data:name. A dot or a bracket goes one level deeper: data:address.city or data:address['city']. All three notations read the same VARIANT. The result of a path is itself a VARIANT until you cast it.

The case rule decides path-equivalence questions. The column name is case-insensitive, so data:name and DATA:name are the same path. The key is case-sensitive, so data:Name is a different key. This is the most-tested point in the whole sub-objective.

Cast with the double colon

A path result is not a number or a string until you cast it. The double-colon operator casts: data:age::INT. Without the cast, the value stays VARIANT, and string values keep their JSON quotes in the output. Comparisons and joins behave correctly only on the cast value.

Goal	Syntax	Result type
Reach a first-level key	`data:name`	VARIANT
Reach a nested key with a dot	`data:address.city`	VARIANT
Reach a nested key with a bracket	`data:address['city']`	VARIANT
Cast a value to a SQL type	`data:age::INT`	INTEGER

🧩

FLATTEN: Arrays Into Rows

Why FLATTEN exists

A VARIANT can hold an array inside one row. SQL reports work in rows, not in arrays held inside a cell. FLATTEN is the table function that splits an array into one row per element. Pair it with LATERAL and it reads the array from each source row.

The six output columns

FLATTEN returns a fixed set of columns. VALUE is the one you select most of the time. It holds each element. The others describe where the element sits.

Column	What it holds
SEQ	A sequence number for the input row, not guaranteed gap-free
KEY	The key, for objects. NULL when the input is an array
PATH	The path to this element inside the structure
INDEX	The array position. NULL when the input is not an array
VALUE	The element itself. This is the column you select
THIS	The element being flattened, used in recursive flattening

LATERAL and the OUTER default

LATERAL lets FLATTEN see the VARIANT column from each row to its left. Without LATERAL, a function in the FROM clause cannot reach a column from an earlier entry. By default OUTER is FALSE. A row whose array is empty or missing produces no output row. Set OUTER => TRUE to keep one row with NULL values instead.

SQL

SELECT data:name::STRING AS name,
       f.value::STRING  AS skill
FROM json_lab,
     LATERAL FLATTEN(input => data:skills) f;

Nested arrays need a second LATERAL FLATTEN

An array inside an array needs two FLATTEN calls. The second reads the output of the first. LATERAL is required, because the inner FLATTEN references o.value:items from the outer one. Chain the calls in the FROM clause, each reading the previous result.

SQL

SELECT o.value:id::INT     AS order_id,
       i.value:p::STRING   AS product
FROM json_lab,
     LATERAL FLATTEN(input => data:orders) o,
     LATERAL FLATTEN(input => o.value:items) i;

When to flatten on load

Querying a VARIANT casts values at run time. That cost is small for mostly regular data. For data with dates, timestamps, or arrays, flatten the keys into typed columns when you load. Typed columns prune better and store smaller. The exam phrases this as the best choice for query performance on date-and-array JSON.

🗂️

Semi-Structured Is Not Unstructured

Semi-structured data has internal structure: keys, arrays, and nesting. It lives in a VARIANT column and you read it with paths and FLATTEN. Unstructured data is files with no queryable structure, such as PDFs, images, and audio. Snowflake does not parse those into a VARIANT.

Unstructured files sit on a stage. A directory table catalogues them and serves file URLs. A directory table stores file metadata such as the relative path, the file size, and the last-modified time. The exam keeps semi-structured and unstructured in separate answer choices, so read which one the stem describes.

Aspect	Semi-structured	Unstructured
Examples	JSON, XML, Avro, Parquet, ORC	PDF, image, audio, video
Stored as	A VARIANT column	Files on a stage
Catalogued by	Table columns	A directory table
Read with	PARSE_JSON, paths, FLATTEN	File URLs and the FILE type

⚡

Cheat Sheet

Concept	What to remember	Exam keyword
VARIANT	One column type for JSON, XML, Avro, Parquet, and ORC. Keeps nesting	“stores semi-structured”
PARSE_JSON vs TRY	PARSE_JSON errors on bad JSON. TRY_PARSE_JSON returns NULL instead	“TRY returns NULL”
Empty string	PARSE_JSON of an empty string returns NULL, not an error	“empty is NULL”
Colon operator	Reaches a first-level key inside a VARIANT, as in data:key	“colon for path”
Dot vs bracket	data:a.b and data:a[‘b’] reach the same nested key	“same path”
Double colon	Casts a VARIANT value to a SQL type, as in data:age::INT	“cast with ::”
Case rule	Column name case-insensitive. JSON key case-sensitive	“key is case-sensitive”
FLATTEN VALUE	VALUE holds each element. INDEX holds the array position	“select VALUE”
LATERAL	Lets FLATTEN read the VARIANT from each left-side row	“LATERAL FLATTEN”
OUTER default	OUTER is FALSE. An empty or missing array drops the row	“OUTER FALSE drops”

🎯

Exam Tip

🎯 Exam Tip

Path equivalence turns on key case, never column case. A question shows several colon paths and asks which match. Ignore the case of the column. Check the case of every key. So data:customer.email equals SRC:customer.email, while data:Customer.Email is a different path.

For FLATTEN, the column that returns each array element is VALUE. A choice naming INDEX is the trap, because INDEX returns the position, not the element. KEY is NULL for arrays.

A nested-array question needs two FLATTEN calls with LATERAL. An option that drops LATERAL fails, because the inner FLATTEN cannot see the outer column. Same idea as the lateral join you saw building the lab.

A load that must survive a few malformed rows uses TRY_PARSE_JSON. PARSE_JSON aborts on the first invalid record. A stem about PDFs, images, or audio is unstructured data, so the answer involves a stage and a directory table, not a VARIANT column.

🛠️

Hands-On Lab

Type: LAB (creates and drops one small VARIANT table) | Time: ~10 minutes | Credits: <0.2 | Prerequisite: any edition trial and the shared lab_xs warehouse. This lab creates one table, json_lab, and drops it at the end. It does not touch any shared sample data.

Create the table and load one JSON object. PARSE_JSON turns the string into a VARIANT.

SQL

USE WAREHOUSE lab_xs;

CREATE OR REPLACE TABLE json_lab (data VARIANT);

INSERT INTO json_lab
SELECT PARSE_JSON('{"name":"Alice","age":30,"skills":["SQL","Python"],"address":{"city":"NYC"}}');

👀 Observe: One row now holds a full JSON object as a VARIANT. The skills array and the nested address object both sit inside one cell.

Read scalar keys and cast them. The colon reaches the key, the double colon casts the value.

SQL

SELECT data:name::STRING        AS name,
       data:age::INT            AS age,
       data:address.city::STRING AS city
FROM json_lab;

👀 Observe: name, age, and city return as real SQL types. Remove ::STRING from name and rerun. The value keeps its JSON quotes, because an uncast path stays VARIANT.

Explode the skills array with LATERAL FLATTEN. One row becomes one row per skill.

SQL

SELECT data:name::STRING AS name,
       f.value::STRING   AS skill
FROM json_lab,
     LATERAL FLATTEN(input => data:skills) f;

👀 Observe: Alice’s single row becomes two, one per skill. The VALUE column holds each skill. The name repeats on both rows, since it comes from the unflattened parent.

Load a nested document and flatten twice. The inner FLATTEN reads the output of the outer one.

SQL

INSERT INTO json_lab
SELECT PARSE_JSON('{"orders":[{"id":1,"items":[{"p":"A","q":2},{"p":"B","q":1}]}]}');

SELECT o.value:id::INT    AS order_id,
       i.value:p::STRING  AS product
FROM json_lab,
     LATERAL FLATTEN(input => data:orders) o,
     LATERAL FLATTEN(input => o.value:items) i;

👀 Observe: Two product rows return. The Alice row has no orders key, so it drops out under the default OUTER FALSE. The second FLATTEN reads o.value:items, which is why LATERAL is required.

See TRY_PARSE_JSON return NULL for a bad string. The second string is missing a closing brace.

SQL

SELECT PARSE_JSON('{"ok":1}')      AS good,
       TRY_PARSE_JSON('{"bad":1')  AS safe;

👀 Observe: The first value parses. The second returns NULL rather than an error. Swap TRY_PARSE_JSON for PARSE_JSON on that string and the statement fails on the malformed input.

✓

Cleanup. Drop the lab table and suspend the shared warehouse.

SQL

DROP TABLE json_lab;
ALTER WAREHOUSE lab_xs SUSPEND;
-- Keep lab_xs itself. It is the shared warehouse reused across days.

❓

Practice Questions

Q1: For a VARIANT column named src, which TWO JSON paths are equivalent in Snowflake? (Select TWO.) ▸

Options:

A. src[‘customer’][‘EMAIL’]
B. src[‘CUSTOMER’][‘Email’]
C. SRC:Customer.Email
D. src:customer.email
E. SRC:customer.email

✅ Answer: D, E

Why D and E: The column name is case-insensitive, so src and SRC point at the same column. Both use the keys customer and email in identical case, so they resolve to one path.

Why not A: EMAIL differs in case from email. JSON keys are case-sensitive, so this is a different element.

Why not B: CUSTOMER and Email differ in case from customer and email, which makes a separate path.

Why not C: Customer and Email again differ in case from the lowercase keys, so this is a third distinct path.

Q2: A query runs LATERAL FLATTEN over a JSON array of skills. Which FLATTEN output column returns each individual skill? ▸

Options:

A. KEY
B. INDEX
C. VALUE
D. PATH

✅ Answer: C

Why C: VALUE holds the element produced for each output row. For an array, that is each item in turn.

Why not A: KEY holds the key name for objects. For an array element it is NULL.

Why not B: INDEX holds the array position, such as 0 or 1, not the element itself.

Why not D: PATH holds the location of the element inside the structure, not its content.

Q3: A document holds an array of orders, and each order holds an array of items. A query flattens orders, then flattens each order’s items. Why must the second FLATTEN use LATERAL? ▸

Options:

A. LATERAL sorts the rows before flattening
B. The second FLATTEN references the output of the first, which only LATERAL exposes
C. LATERAL casts the VARIANT to a string automatically
D. Without LATERAL, FLATTEN cannot produce the VALUE column

✅ Answer: B

Why B: The inner FLATTEN reads o.value:items from the first call. LATERAL is what lets a later entry in the FROM clause reference an earlier one.

Why not A: LATERAL does not sort anything. Order comes from ORDER BY.

Why not C: No automatic cast happens. You cast a value yourself with the double colon.

Why not D: FLATTEN produces VALUE with or without LATERAL. LATERAL governs cross-references, not the output columns.

Q4: A VARIANT column data holds {“age”:30}. A query needs age as an INTEGER for a numeric comparison. Which expression returns an INTEGER? ▸

Options:

A. data:age
B. data.age
C. data:age::INT
D. data[‘age’]

✅ Answer: C

Why C: The colon reaches the age key, and the double colon casts the value to INTEGER. Without the cast the value stays VARIANT.

Why not A: data:age returns a VARIANT, not an INTEGER. A numeric comparison may not behave as intended.

Why not B: A leading dot does not reach a top-level key. The colon does that job.

Why not D: data[‘age’] also returns a VARIANT. It reaches the key but performs no cast.

Q5: A load reads thousands of JSON strings, and a few are malformed. The load must skip bad rows as NULL instead of failing. Which function fits? ▸

Options:

A. PARSE_JSON, because it ignores malformed input
B. TRY_PARSE_JSON, because it returns NULL on a parse error
C. CHECK_JSON, because it parses the valid rows
D. TO_JSON, because it converts strings to VARIANT

✅ Answer: B

Why B: TRY_PARSE_JSON returns NULL when parsing fails, so a malformed row does not stop the statement.

Why not A: PARSE_JSON raises an error on invalid JSON. One bad row aborts the load.

Why not C: CHECK_JSON validates input and returns an error message for bad JSON. It does not return the parsed VARIANT.

Why not D: TO_JSON works the other way. It turns a VARIANT into a string.

❄️

Snowflake Documentation

Official docs for today’s topics. The exam pulls directly from these.

❄️

Querying Semi-Structured Data

docs.snowflake.com/en/user-guide/querying-semistructured

→

❄️

Semi-Structured Data Types (VARIANT)

docs.snowflake.com/en/sql-reference/data-types-semistructured

→

❄️

FLATTEN

docs.snowflake.com/en/sql-reference/functions/flatten

→

❄️

PARSE_JSON and TRY_PARSE_JSON

docs.snowflake.com/en/sql-reference/functions/parse_json

→

📝 Recap

Today you learned: how Snowflake stores semi-structured data in a VARIANT column. PARSE_JSON loads a JSON string, and TRY_PARSE_JSON returns NULL on bad input. The colon reaches a first-level key, a dot or bracket goes deeper, and the double colon casts a value to a SQL type. The column name is case-insensitive, but a JSON key is case-sensitive. FLATTEN splits an array into one row per element, with VALUE holding each element. LATERAL FLATTEN reads the array from each source row, and a nested array needs a second LATERAL FLATTEN. By default OUTER is FALSE, so an empty or missing array drops the row.

Key takeaway: for a path-equivalence question, ignore the column case and check the key case. For top performance on JSON full of dates and arrays, flatten the keys into typed columns on load. Semi-structured data lives in a VARIANT column. Unstructured files live on a stage with a directory table. The exam treats those as separate answers.

Tomorrow (Day 43): Week 6 Recap with a mixed Domain 4 quiz. You revisit Query Profile, the three caches, Search Optimization, clustering, materialized views, window functions, QUALIFY, and FLATTEN. A set of ten mixed questions then tests the whole domain.

← Day 41 All 50 Days Day 43 →

About the Author

Abhay Krishnan

Senior Data & AI Consultant

Connect on LinkedIn

With over five years of data engineering experience at EY and Infosys, Abhay Krishnan specializes in building scalable data pipelines and cloud warehousing solutions. He is a certified SnowPro Core professional, alongside credentials in AWS and Azure. Abhay created this 50-day track to solve a problem he faced firsthand: the lack of a structured, free resource for Snowflake certification prep. Follow him on LinkedIn for more data engineering insights.

Snowflake Semi-Structured Data – VARIANT, FLATTEN & JSON Paths

The VARIANT Column and Path Notation

What VARIANT stores

Load JSON with PARSE_JSON

Read keys with colon, dot, and bracket

Cast with the double colon

FLATTEN: Arrays Into Rows

Why FLATTEN exists

The six output columns

LATERAL and the OUTER default

Nested arrays need a second LATERAL FLATTEN

When to flatten on load

Semi-Structured Is Not Unstructured

Cheat Sheet

Exam Tip

Hands-On Lab

Practice Questions

Snowflake Documentation

Topics