-
Notifications
You must be signed in to change notification settings - Fork 8
Ingestion of Jaffle Shop Data to Snowflake #58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 53-host-jaffle-shop-parquet-files-on-gcs
Are you sure you want to change the base?
Ingestion of Jaffle Shop Data to Snowflake #58
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a comprehensive data integration layer for loading the Jaffle Shop dataset from cloud storage (GCS/S3) into Snowflake. The implementation provides typed configuration, automatic infrastructure bootstrapping, robust session management, and SQL utilities with comprehensive test coverage.
Key changes:
- Infrastructure auto-provisioning with idempotent warehouse/database creation
- Cloud storage integration supporting both GCS (via HTTPS download/upload) and S3 (via external staging)
- Comprehensive test suite with unit tests (mocked) and integration tests (real Snowflake connections)
Reviewed changes
Copilot reviewed 25 out of 27 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
data/__init__.py |
Public API exports for settings, session management, bootstrapping, and ingestion functions |
data/_settings.py |
Typed Snowflake configuration loaded from environment variables using Pydantic |
data/_snowflake_session.py |
Session factory with context manager support and error handling |
data/_bootstrap.py |
Idempotent warehouse and database creation with privilege checks |
data/_sql_loader.py |
Utility for loading and parameterizing SQL templates |
data/ingestion.py |
Main ingestion logic for GCS and S3 with file caching and stage management |
data/sql/**/*.sql |
SQL templates for schema, stage, table creation and data loading |
ingest_jaffle_shop_data.py |
CLI script orchestrating end-to-end GCS ingestion |
tests/**/*.py |
Comprehensive unit and integration test suite with fixtures |
pyproject.toml |
Project configuration with dependencies and linting rules |
mise.toml |
Environment variable configuration using 1Password integration |
.gitignore |
Updated to exclude LLM-related files and fix preparation/ path |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ngestion-module-for-gcss3-resources-to-snowflake
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 25 out of 27 changed files in this pull request and generated 21 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| List of dropped schema names. | ||
| """ | ||
| if database: | ||
| _ = session.sql(f"USE DATABASE {database}").collect() |
Copilot
AI
Dec 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL injection vulnerability: Using f-string formatting to construct USE DATABASE SQL with user-controlled database parameter without proper escaping or parameterization. An attacker could inject malicious SQL through the database parameter. Use parameterized queries or proper identifier escaping instead.
| -- Create internal stage for uploading local parquet files | ||
| -- Used when loading data via session.file.put() instead of external storage integration | ||
| CREATE OR REPLACE STAGE {schema_name}.{stage_name} | ||
| FILE_FORMAT = {schema_name}.PARQUET_FORMAT; |
Copilot
AI
Dec 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing semicolon at the end of SQL statement. While Snowflake can execute statements without semicolons, it's a best practice to include them for consistency and clarity, especially when statements might be batched together.
|
|
||
| for row in result: | ||
| db_name = str(row["DATABASE_NAME"]) # pyright: ignore[reportUnknownArgumentType] | ||
| _ = session.sql(f"DROP DATABASE IF EXISTS {db_name} CASCADE").collect() |
Copilot
AI
Dec 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL injection vulnerability: Using f-string formatting to construct DROP DATABASE SQL with user-controlled database name from query results without proper escaping. An attacker could potentially inject malicious SQL through crafted database names. Use parameterized queries or proper identifier escaping instead.
|
|
||
| # Post-test cleanup: drop the test database | ||
| logger.info(f"Dropping test database: {db_name}") | ||
| _ = session.sql(f"DROP DATABASE IF EXISTS {db_name} CASCADE").collect() |
Copilot
AI
Dec 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL injection vulnerability: Using f-string formatting to construct SQL queries with user-controlled database name (db_name) without proper escaping or parameterization. An attacker could inject malicious SQL through the database name. Use parameterized queries or proper identifier escaping instead.
| -- No credentials required for public buckets | ||
| CREATE OR REPLACE STAGE {schema_name}.{stage_name} | ||
| URL = '{bucket_url}' | ||
| FILE_FORMAT = {schema_name}.PARQUET_FORMAT |
Copilot
AI
Dec 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing semicolon at the end of SQL statement. While Snowflake can execute statements without semicolons, it's a best practice to include them for consistency and clarity, especially when statements might be batched together.
| FILE_FORMAT = {schema_name}.PARQUET_FORMAT | |
| FILE_FORMAT = {schema_name}.PARQUET_FORMAT; |
|
|
||
| # Create the test database | ||
| logger.info(f"Creating test database: {db_name}") | ||
| _ = session.sql(f"CREATE DATABASE IF NOT EXISTS {db_name}").collect() |
Copilot
AI
Dec 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL injection vulnerability: Using f-string formatting to construct SQL queries with user-controlled database name (db_name) without proper escaping or parameterization. An attacker could inject malicious SQL through the database name. Use parameterized queries or proper identifier escaping instead.
| with create_session(snowflake_settings) as session: | ||
| # Switch to the test database | ||
| logger.info(f"Using test database: {test_database}") | ||
| _ = session.sql(f"USE DATABASE {test_database}").collect() |
Copilot
AI
Dec 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL injection vulnerability: Using f-string formatting to construct SQL queries with user-controlled database name (test_database) without proper escaping or parameterization. An attacker could inject malicious SQL through the database name. Use parameterized queries or proper identifier escaping instead.
integration/snowflake/pyproject.toml
Outdated
| "TC002", | ||
| # Allow standard library imports outside type-checking blocks | ||
| "TC003", | ||
| # Allow TODO comments |
Copilot
AI
Dec 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trailing whitespace after the TODO comment on line 118. This should be removed for consistency with code style guidelines.
| # Allow TODO comments | |
| # Allow TODO comments |
| @@ -0,0 +1 @@ | |||
| CREATE SCHEMA IF NOT EXISTS {schema_name} | |||
Copilot
AI
Dec 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing semicolon at the end of SQL statement. While Snowflake can execute statements without semicolons, it's a best practice to include them for consistency and clarity, especially when statements might be batched together.
| CREATE SCHEMA IF NOT EXISTS {schema_name} | |
| CREATE SCHEMA IF NOT EXISTS {schema_name}; |
…date pyproject.toml to remove readme reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary
The implementation is functional but over-engineered for its scope. Per the proposed structure in #42:
getml-demo/integration/snowflake/
├── data/
│ ├── ingestion.py # GCS → Snowflake loader
│ ├── preparation.py # Orchestration module
│ └── sql/
│ ├── create_population.sql
│ ├── calculate_target.sql
│ └── ...
└── tests/
└── test_data_pipeline.py # Integration test
We don't need more than this. KISS and YAGNI.
Specific issues:
- Single-purpose wrapper modules (_sql_loader.py, _snowflake_session.py) add indirection without value
- Custom exceptions (BootstrapError, DataIngestionError) don't add semantic value over standard exceptions
- Test classes used purely for namespacing rather than shared fixtures
- pyproject.toml contains many explicit defaults that could be removed
See inline comments for details.
| @@ -0,0 +1,6 @@ | |||
| CREATE WAREHOUSE IF NOT EXISTS {warehouse_name} | |||
| WITH | |||
| WAREHOUSE_SIZE = '{warehouse_size}' | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be properly parameterized.
| @@ -0,0 +1,4 @@ | |||
| -- Create internal stage for uploading local parquet files | |||
| -- Used when loading data via session.file.put() instead of external storage integration | |||
| CREATE OR REPLACE STAGE {schema_name}.{stage_name} | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you need to parameterize schema and stage, you should validate the identifiers. The SQLIdentifier pattern used by @awaismirza92 is the right approach. Make this a shared resources and do the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Snowflake has IDENTIFIER what is the right approach here. @awaismirza92's pydantic-approach, while fine, also feels like not fitting the use case (one-off preparation for notebooks).
| @@ -0,0 +1,40 @@ | |||
| """SQL file loading utilities. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a 40-line module to wrap two lines of actual logic:
content = (some_path / path).read_text()
return content.format(**kwargs) if kwargs else contentThe indirection buys you almost nothing:
- Fixed directory, now your SQL must live in
data/sql/, no flexibility - String formatting for SQL, which is the injection problem outlined above
- No validation — kwargs go straight into
.format() - Extra cognitive load — reader has to find this module to understand what
load_sqldoes
Compare to just writing it inline where you need it:
sql = (SQL_DIR / "create_schema.sql").read_text().format(schema_name=validated_name)Same thing, zero indirection, obvious what's happening.
If you were to keep a helper, it should at least earn its keep by adding the identifier validation:
def load_sql(path: str, **identifiers: SqlIdentifier) -> str:
...But even then — do you have enough SQL files to justify this? Or is it one of those "I might need this later" abstractions that just adds grep distance?
| @@ -0,0 +1,54 @@ | |||
| """Snowflake session management for getML Feature Store integration. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same energy. _establish_session is a 5-line function that wraps a one-liner and gets called exactly once.
Session.builder.configs(connection_params).create()That's the whole thing. The try/except doesn't add much — you're catching Exception, wrapping it in a Snowpark-specific exception, and including... the same info that would already be in the traceback.
The entire module could be:
def create_session(settings: SnowflakeSettings) -> Session:
return Session.builder.configs({
"account": settings.account,
"user": settings.user,
"password": settings.password.get_secret_value(),
"role": settings.role,
"warehouse": settings.warehouse,
"database": settings.database,
"schema": settings.schema_name,
}).create()Or arguably just inline it at the call site, since there's probably only one or two places that create sessions.
The logger isn't even used. The docstrings are longer than the code. The _establish_session split suggests there was going to be retry logic or connection pooling that never materialized.
It's not wrong, it's just... enterprise-brained. Every concept gets a file, every operation gets a function, every function gets a docstring, and the actual logic is buried under ceremony.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is something wrong with your LLM guidelines:
- Docstrings longer than the code they document
- "Usage example" in every module — even trivial ones
- Defensive splitting —
_establish_sessionexists so there's a "private implementation" and a "public interface" for a one-liner - Over-typed obvious things —
logger: logging.Logger = logging.getLogger(__name__) - Args/Returns/Raises for simple functions — documenting what's already obvious from the signature
| AUTO_SUSPEND_SECONDS = 60 | ||
|
|
||
|
|
||
| class BootstrapError(Exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the custom exception? This is a notebook ingestion script, not a library — no one will catch BootstrapError specifically. YAGNI. It feels wrong to spend so much of your complexity budget on stuff like this, when the actual SQL is just f"SHOW WAREHOUSES LIKE '{warehouse_name}'"?
Your guidelines seem to be optimizing for the wrong things: docstrings, module structure, custom exceptions — over actual robustness like input validation and keeping the code readable.
| [tool.pytest.ini_options] | ||
| pythonpath = ["."] | ||
| testpaths = ["tests"] | ||
| python_files = ["test_*.py"] | ||
| python_classes = ["Test*"] | ||
| python_functions = ["test_*"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of those are defaults and will, by convention, never change. Why the noise?
| venvPath = "." | ||
| venv = ".venv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just defaults.
| reportMissingTypeStubs = false | ||
| reportImplicitStringConcatenation = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rational behind setting those?
| line-length = 88 | ||
| target-version = "py312" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defaults (version is inferred from line 27).
| preview = false | ||
| quote-style = "double" | ||
| line-ending = "auto" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defaults.
| -- No credentials required for public buckets | ||
| CREATE OR REPLACE STAGE {schema_name}.{stage_name} | ||
| URL = '{bucket_url}' | ||
| FILE_FORMAT = {schema_name}.PARQUET_FORMAT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use parameters.
| )) | ||
| FROM TABLE( | ||
| INFER_SCHEMA( | ||
| LOCATION=>'@{schema_name}.{stage_name}/{table_name}.parquet', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use parameters.
This pull request introduces a new, modular data integration layer for loading and preparing the Jaffle Shop dataset in Snowflake for getML Feature Store integration. It provides robust infrastructure bootstrapping, typed configuration, session management, and SQL utilities, all with clear documentation and usage examples. The codebase is organized for maintainability and extensibility, and includes scripts, configuration, and test scaffolding.
The most important changes are:
Infrastructure Bootstrapping and Session Management
ensure_infrastructureandBootstrapErrorindata/_bootstrap.pyto automatically create Snowflake warehouses and databases if missing, with idempotent operations and clear error handling.create_sessionindata/_snowflake_session.pyfor robust, context-managed Snowflake Snowpark sessions, including error handling for failed connections.Configuration and Environment Management
SnowflakeSettingsindata/_settings.pyfor typed configuration loaded fromSNOWFLAKE_*environment variables, making authentication and connection setup consistent and secure.mise.tomlfile with templated environment variable setup for seamless local development and CI configuration.Data Ingestion and SQL Utilities
data/_sql_loader.pyutility for loading and formatting SQL files, and added a suite of parameterized SQL templates for schema, stage, and table creation, as well as data ingestion from Parquet files and cloud storage. [1] [2] [3] [4] [5] [6] [7] [8] [9]ingest_jaffle_shop_data.pyscript to orchestrate end-to-end data ingestion from a public GCS bucket into Snowflake, with automatic infrastructure setup.Project Structure and Documentation
__init__.pyfiles, public API exports, and comprehensive docstrings for all modules. [1] [2]pyproject.tomlwith dependencies, development tools, and code style/linting configuration for consistent development and testing.