Ingestion of Jaffle Shop Data to Snowflake #58

cyclux · 2025-12-05T17:39:55Z

This pull request introduces a new, modular data integration layer for loading and preparing the Jaffle Shop dataset in Snowflake for getML Feature Store integration. It provides robust infrastructure bootstrapping, typed configuration, session management, and SQL utilities, all with clear documentation and usage examples. The codebase is organized for maintainability and extensibility, and includes scripts, configuration, and test scaffolding.

The most important changes are:

Infrastructure Bootstrapping and Session Management

Added ensure_infrastructure and BootstrapError in data/_bootstrap.py to automatically create Snowflake warehouses and databases if missing, with idempotent operations and clear error handling.
Introduced create_session in data/_snowflake_session.py for robust, context-managed Snowflake Snowpark sessions, including error handling for failed connections.

Configuration and Environment Management

Added SnowflakeSettings in data/_settings.py for typed configuration loaded from SNOWFLAKE_* environment variables, making authentication and connection setup consistent and secure.
Provided a mise.toml file with templated environment variable setup for seamless local development and CI configuration.

Data Ingestion and SQL Utilities

Created data/_sql_loader.py utility for loading and formatting SQL files, and added a suite of parameterized SQL templates for schema, stage, and table creation, as well as data ingestion from Parquet files and cloud storage. [1] [2] [3] [4] [5] [6] [7] [8] [9]
Added ingest_jaffle_shop_data.py script to orchestrate end-to-end data ingestion from a public GCS bucket into Snowflake, with automatic infrastructure setup.

Project Structure and Documentation

Established a clear package structure with __init__.py files, public API exports, and comprehensive docstrings for all modules. [1] [2]
Added pyproject.toml with dependencies, development tools, and code style/linting configuration for consistent development and testing.

Copilot

Pull request overview

This PR introduces a comprehensive data integration layer for loading the Jaffle Shop dataset from cloud storage (GCS/S3) into Snowflake. The implementation provides typed configuration, automatic infrastructure bootstrapping, robust session management, and SQL utilities with comprehensive test coverage.

Key changes:

Infrastructure auto-provisioning with idempotent warehouse/database creation
Cloud storage integration supporting both GCS (via HTTPS download/upload) and S3 (via external staging)
Comprehensive test suite with unit tests (mocked) and integration tests (real Snowflake connections)

Reviewed changes

Copilot reviewed 25 out of 27 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`data/__init__.py`	Public API exports for settings, session management, bootstrapping, and ingestion functions
`data/_settings.py`	Typed Snowflake configuration loaded from environment variables using Pydantic
`data/_snowflake_session.py`	Session factory with context manager support and error handling
`data/_bootstrap.py`	Idempotent warehouse and database creation with privilege checks
`data/_sql_loader.py`	Utility for loading and parameterizing SQL templates
`data/ingestion.py`	Main ingestion logic for GCS and S3 with file caching and stage management
`data/sql/*/.sql`	SQL templates for schema, stage, table creation and data loading
`ingest_jaffle_shop_data.py`	CLI script orchestrating end-to-end GCS ingestion
`tests/*/.py`	Comprehensive unit and integration test suite with fixtures
`pyproject.toml`	Project configuration with dependencies and linting rules
`mise.toml`	Environment variable configuration using 1Password integration
`.gitignore`	Updated to exclude LLM-related files and fix preparation/ path

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ngestion-module-for-gcss3-resources-to-snowflake

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 25 out of 27 changed files in this pull request and generated 21 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-14T22:14:59Z

integration/snowflake/tests/cleanup_test_resources.py

+        List of dropped schema names.
+    """
+    if database:
+        _ = session.sql(f"USE DATABASE {database}").collect()


SQL injection vulnerability: Using f-string formatting to construct USE DATABASE SQL with user-controlled database parameter without proper escaping or parameterization. An attacker could inject malicious SQL through the database parameter. Use parameterized queries or proper identifier escaping instead.

Copilot · 2025-12-14T22:14:59Z

integration/snowflake/data/sql/ingestion/create_internal_stage.sql

+-- Create internal stage for uploading local parquet files
+-- Used when loading data via session.file.put() instead of external storage integration
+CREATE OR REPLACE STAGE {schema_name}.{stage_name}
+    FILE_FORMAT = {schema_name}.PARQUET_FORMAT;


Missing semicolon at the end of SQL statement. While Snowflake can execute statements without semicolons, it's a best practice to include them for consistency and clarity, especially when statements might be batched together.

Copilot · 2025-12-14T22:15:00Z

integration/snowflake/tests/cleanup_test_resources.py

+
+    for row in result:
+        db_name = str(row["DATABASE_NAME"])  # pyright: ignore[reportUnknownArgumentType]
+        _ = session.sql(f"DROP DATABASE IF EXISTS {db_name} CASCADE").collect()


SQL injection vulnerability: Using f-string formatting to construct DROP DATABASE SQL with user-controlled database name from query results without proper escaping. An attacker could potentially inject malicious SQL through crafted database names. Use parameterized queries or proper identifier escaping instead.

Copilot · 2025-12-14T22:15:00Z

integration/snowflake/tests/integration/conftest.py

+
+        # Post-test cleanup: drop the test database
+        logger.info(f"Dropping test database: {db_name}")
+        _ = session.sql(f"DROP DATABASE IF EXISTS {db_name} CASCADE").collect()


SQL injection vulnerability: Using f-string formatting to construct SQL queries with user-controlled database name (db_name) without proper escaping or parameterization. An attacker could inject malicious SQL through the database name. Use parameterized queries or proper identifier escaping instead.

Copilot · 2025-12-14T22:15:00Z

integration/snowflake/data/sql/ingestion/create_stage_s3_public.sql

+-- No credentials required for public buckets
+CREATE OR REPLACE STAGE {schema_name}.{stage_name}
+    URL = '{bucket_url}'
+    FILE_FORMAT = {schema_name}.PARQUET_FORMAT


Missing semicolon at the end of SQL statement. While Snowflake can execute statements without semicolons, it's a best practice to include them for consistency and clarity, especially when statements might be batched together.

Suggested change

FILE_FORMAT = {schema_name}.PARQUET_FORMAT

FILE_FORMAT = {schema_name}.PARQUET_FORMAT;

Copilot · 2025-12-14T22:15:03Z

integration/snowflake/tests/integration/conftest.py

+
+        # Create the test database
+        logger.info(f"Creating test database: {db_name}")
+        _ = session.sql(f"CREATE DATABASE IF NOT EXISTS {db_name}").collect()


SQL injection vulnerability: Using f-string formatting to construct SQL queries with user-controlled database name (db_name) without proper escaping or parameterization. An attacker could inject malicious SQL through the database name. Use parameterized queries or proper identifier escaping instead.

Copilot · 2025-12-14T22:15:04Z

integration/snowflake/tests/integration/conftest.py

+    with create_session(snowflake_settings) as session:
+        # Switch to the test database
+        logger.info(f"Using test database: {test_database}")
+        _ = session.sql(f"USE DATABASE {test_database}").collect()


SQL injection vulnerability: Using f-string formatting to construct SQL queries with user-controlled database name (test_database) without proper escaping or parameterization. An attacker could inject malicious SQL through the database name. Use parameterized queries or proper identifier escaping instead.

Copilot · 2025-12-14T22:15:04Z

integration/snowflake/pyproject.toml

+    "TC002",
+    # Allow standard library imports outside type-checking blocks
+    "TC003",
+    # Allow TODO comments  


Trailing whitespace after the TODO comment on line 118. This should be removed for consistency with code style guidelines.

Suggested change

# Allow TODO comments

# Allow TODO comments

Copilot · 2025-12-14T22:15:04Z

integration/snowflake/data/sql/common/create_schema.sql

@@ -0,0 +1 @@
+CREATE SCHEMA IF NOT EXISTS {schema_name}


Missing semicolon at the end of SQL statement. While Snowflake can execute statements without semicolons, it's a best practice to include them for consistency and clarity, especially when statements might be batched together.

Suggested change

CREATE SCHEMA IF NOT EXISTS {schema_name}

CREATE SCHEMA IF NOT EXISTS {schema_name};

integration/snowflake/data/sql/bootstrap/create_warehouse.sql

…date pyproject.toml to remove readme reference

srnnkls

Summary

The implementation is functional but over-engineered for its scope. Per the proposed structure in #42:

getml-demo/integration/snowflake/
├── data/
│   ├── ingestion.py          # GCS → Snowflake loader
│   ├── preparation.py        # Orchestration module
│   └── sql/
│       ├── create_population.sql
│       ├── calculate_target.sql
│       └── ...
└── tests/
    └── test_data_pipeline.py  # Integration test

We don't need more than this. KISS and YAGNI.

Specific issues:

Single-purpose wrapper modules (_sql_loader.py, _snowflake_session.py) add indirection without value
Custom exceptions (BootstrapError, DataIngestionError) don't add semantic value over standard exceptions
Test classes used purely for namespacing rather than shared fixtures
pyproject.toml contains many explicit defaults that could be removed

See inline comments for details.

srnnkls · 2025-12-15T10:16:57Z

integration/snowflake/data/sql/bootstrap/create_warehouse.sql

@@ -0,0 +1,6 @@
+CREATE WAREHOUSE IF NOT EXISTS {warehouse_name}
+WITH
+    WAREHOUSE_SIZE = '{warehouse_size}'


Should be properly parameterized.

srnnkls · 2025-12-15T10:21:06Z

integration/snowflake/data/sql/ingestion/create_internal_stage.sql

@@ -0,0 +1,4 @@
+-- Create internal stage for uploading local parquet files
+-- Used when loading data via session.file.put() instead of external storage integration
+CREATE OR REPLACE STAGE {schema_name}.{stage_name}


If you need to parameterize schema and stage, you should validate the identifiers. The SQLIdentifier pattern used by @awaismirza92 is the right approach. Make this a shared resources and do the same.

Snowflake has IDENTIFIER what is the right approach here. @awaismirza92's pydantic-approach, while fine, also feels like not fitting the use case (one-off preparation for notebooks).

srnnkls · 2025-12-15T10:23:45Z

integration/snowflake/data/_sql_loader.py

@@ -0,0 +1,40 @@
+"""SQL file loading utilities.


This is a 40-line module to wrap two lines of actual logic:

content = (some_path / path).read_text() return content.format(**kwargs) if kwargs else content

The indirection buys you almost nothing:

Fixed directory, now your SQL must live in data/sql/, no flexibility

String formatting for SQL, which is the injection problem outlined above

No validation — kwargs go straight into .format()

Extra cognitive load — reader has to find this module to understand what load_sql does

Compare to just writing it inline where you need it:

sql = (SQL_DIR / "create_schema.sql").read_text().format(schema_name=validated_name)

Same thing, zero indirection, obvious what's happening.

If you were to keep a helper, it should at least earn its keep by adding the identifier validation:

def load_sql(path: str, **identifiers: SqlIdentifier) -> str: ...

But even then — do you have enough SQL files to justify this? Or is it one of those "I might need this later" abstractions that just adds grep distance?

srnnkls · 2025-12-15T10:26:13Z

integration/snowflake/data/_snowflake_session.py

@@ -0,0 +1,54 @@
+"""Snowflake session management for getML Feature Store integration.


Same energy. _establish_session is a 5-line function that wraps a one-liner and gets called exactly once.

Session.builder.configs(connection_params).create()

That's the whole thing. The try/except doesn't add much — you're catching Exception, wrapping it in a Snowpark-specific exception, and including... the same info that would already be in the traceback.

The entire module could be:

def create_session(settings: SnowflakeSettings) -> Session: return Session.builder.configs({ "account": settings.account, "user": settings.user, "password": settings.password.get_secret_value(), "role": settings.role, "warehouse": settings.warehouse, "database": settings.database, "schema": settings.schema_name, }).create()

Or arguably just inline it at the call site, since there's probably only one or two places that create sessions.

The logger isn't even used. The docstrings are longer than the code. The _establish_session split suggests there was going to be retry logic or connection pooling that never materialized.

It's not wrong, it's just... enterprise-brained. Every concept gets a file, every operation gets a function, every function gets a docstring, and the actual logic is buried under ceremony.

I think there is something wrong with your LLM guidelines:

Docstrings longer than the code they document

"Usage example" in every module — even trivial ones

Defensive splitting — _establish_session exists so there's a "private implementation" and a "public interface" for a one-liner

Over-typed obvious things — logger: logging.Logger = logging.getLogger(__name__)

Args/Returns/Raises for simple functions — documenting what's already obvious from the signature

srnnkls · 2025-12-15T10:38:10Z

integration/snowflake/data/_bootstrap.py

+AUTO_SUSPEND_SECONDS = 60
+
+
+class BootstrapError(Exception):


Why the custom exception? This is a notebook ingestion script, not a library — no one will catch BootstrapError specifically. YAGNI. It feels wrong to spend so much of your complexity budget on stuff like this, when the actual SQL is just f"SHOW WAREHOUSES LIKE '{warehouse_name}'"?
Your guidelines seem to be optimizing for the wrong things: docstrings, module structure, custom exceptions — over actual robustness like input validation and keeping the code readable.

srnnkls · 2025-12-15T10:49:34Z

integration/snowflake/pyproject.toml

+[tool.pytest.ini_options]
+pythonpath = ["."]
+testpaths = ["tests"]
+python_files = ["test_*.py"]
+python_classes = ["Test*"]
+python_functions = ["test_*"]


All of those are defaults and will, by convention, never change. Why the noise?

srnnkls · 2025-12-15T10:50:44Z

integration/snowflake/pyproject.toml

+venvPath = "."
+venv = ".venv"


Just defaults.

srnnkls · 2025-12-15T10:51:05Z

integration/snowflake/pyproject.toml

+reportMissingTypeStubs = false
+reportImplicitStringConcatenation = false


What's the rational behind setting those?

srnnkls · 2025-12-15T10:53:39Z

integration/snowflake/pyproject.toml

+line-length = 88
+target-version = "py312"


Defaults (version is inferred from line 27).

srnnkls · 2025-12-15T10:54:17Z

integration/snowflake/pyproject.toml

+preview = false
+quote-style = "double"
+line-ending = "auto"


srnnkls · 2025-12-16T13:01:21Z

integration/snowflake/data/sql/ingestion/create_stage_s3_public.sql

+-- No credentials required for public buckets
+CREATE OR REPLACE STAGE {schema_name}.{stage_name}
+    URL = '{bucket_url}'
+    FILE_FORMAT = {schema_name}.PARQUET_FORMAT;


Use parameters.

srnnkls · 2025-12-16T13:01:32Z

integration/snowflake/data/sql/ingestion/create_table_from_parquet.sql

+    ))
+      FROM TABLE(
+        INFER_SCHEMA(
+          LOCATION=>'@{schema_name}.{stage_name}/{table_name}.parquet',


Use parameters.

Ingestion of Jaffle Shop Data to Snowflake

0dfe420

cyclux self-assigned this Dec 5, 2025

cyclux requested review from Copilot and srnnkls December 5, 2025 17:54

Copilot started reviewing on behalf of cyclux December 5, 2025 17:56 View session

Copilot finished reviewing on behalf of cyclux December 5, 2025 17:58

Copilot AI reviewed Dec 5, 2025

View reviewed changes

Merge branch '53-host-jaffle-shop-parquet-files-on-gcs' into 54-add-i…

3d64220

…ngestion-module-for-gcss3-resources-to-snowflake

cyclux requested a review from Copilot December 14, 2025 21:26

Copilot started reviewing on behalf of cyclux December 14, 2025 21:27 View session

Copilot AI reviewed Dec 14, 2025

View reviewed changes

cyclux requested a review from Copilot December 14, 2025 22:05

Copilot started reviewing on behalf of cyclux December 14, 2025 22:06 View session

Copilot AI reviewed Dec 14, 2025

View reviewed changes

Fix SQL syntax by adding missing semicolons in various scripts and up…

b707dba

…date pyproject.toml to remove readme reference

srnnkls requested changes Dec 15, 2025

View reviewed changes

srnnkls mentioned this pull request Dec 16, 2025

Add data preparation module for Databricks notebook #63

Open

srnnkls reviewed Dec 16, 2025

View reviewed changes

	FILE_FORMAT = {schema_name}.PARQUET_FORMAT
	FILE_FORMAT = {schema_name}.PARQUET_FORMAT;

	CREATE SCHEMA IF NOT EXISTS {schema_name}
	CREATE SCHEMA IF NOT EXISTS {schema_name};

		@@ -0,0 +1,54 @@
		"""Snowflake session management for getML Feature Store integration.

		reportMissingTypeStubs = false
		reportImplicitStringConcatenation = false

		AUTO_SUSPEND_SECONDS = 60


		class BootstrapError(Exception):

		venvPath = "."
		venv = ".venv"

		line-length = 88
		target-version = "py312"

Ingestion of Jaffle Shop Data to Snowflake #58

Are you sure you want to change the base?

Ingestion of Jaffle Shop Data to Snowflake #58

Uh oh!

Conversation

cyclux commented Dec 5, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

srnnkls left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Summary

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srnnkls Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srnnkls left a comment •

edited

Loading

srnnkls Dec 15, 2025 •

edited

Loading