Skip to content

Conversation

@cyclux
Copy link
Contributor

@cyclux cyclux commented Dec 5, 2025

This pull request introduces a new, modular data integration layer for loading and preparing the Jaffle Shop dataset in Snowflake for getML Feature Store integration. It provides robust infrastructure bootstrapping, typed configuration, session management, and SQL utilities, all with clear documentation and usage examples. The codebase is organized for maintainability and extensibility, and includes scripts, configuration, and test scaffolding.

The most important changes are:

Infrastructure Bootstrapping and Session Management

  • Added ensure_infrastructure and BootstrapError in data/_bootstrap.py to automatically create Snowflake warehouses and databases if missing, with idempotent operations and clear error handling.
  • Introduced create_session in data/_snowflake_session.py for robust, context-managed Snowflake Snowpark sessions, including error handling for failed connections.

Configuration and Environment Management

  • Added SnowflakeSettings in data/_settings.py for typed configuration loaded from SNOWFLAKE_* environment variables, making authentication and connection setup consistent and secure.
  • Provided a mise.toml file with templated environment variable setup for seamless local development and CI configuration.

Data Ingestion and SQL Utilities

  • Created data/_sql_loader.py utility for loading and formatting SQL files, and added a suite of parameterized SQL templates for schema, stage, and table creation, as well as data ingestion from Parquet files and cloud storage. [1] [2] [3] [4] [5] [6] [7] [8] [9]
  • Added ingest_jaffle_shop_data.py script to orchestrate end-to-end data ingestion from a public GCS bucket into Snowflake, with automatic infrastructure setup.

Project Structure and Documentation

  • Established a clear package structure with __init__.py files, public API exports, and comprehensive docstrings for all modules. [1] [2]
  • Added pyproject.toml with dependencies, development tools, and code style/linting configuration for consistent development and testing.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a comprehensive data integration layer for loading the Jaffle Shop dataset from cloud storage (GCS/S3) into Snowflake. The implementation provides typed configuration, automatic infrastructure bootstrapping, robust session management, and SQL utilities with comprehensive test coverage.

Key changes:

  • Infrastructure auto-provisioning with idempotent warehouse/database creation
  • Cloud storage integration supporting both GCS (via HTTPS download/upload) and S3 (via external staging)
  • Comprehensive test suite with unit tests (mocked) and integration tests (real Snowflake connections)

Reviewed changes

Copilot reviewed 25 out of 27 changed files in this pull request and generated no comments.

Show a summary per file
File Description
data/__init__.py Public API exports for settings, session management, bootstrapping, and ingestion functions
data/_settings.py Typed Snowflake configuration loaded from environment variables using Pydantic
data/_snowflake_session.py Session factory with context manager support and error handling
data/_bootstrap.py Idempotent warehouse and database creation with privilege checks
data/_sql_loader.py Utility for loading and parameterizing SQL templates
data/ingestion.py Main ingestion logic for GCS and S3 with file caching and stage management
data/sql/**/*.sql SQL templates for schema, stage, table creation and data loading
ingest_jaffle_shop_data.py CLI script orchestrating end-to-end GCS ingestion
tests/**/*.py Comprehensive unit and integration test suite with fixtures
pyproject.toml Project configuration with dependencies and linting rules
mise.toml Environment variable configuration using 1Password integration
.gitignore Updated to exclude LLM-related files and fix preparation/ path

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ngestion-module-for-gcss3-resources-to-snowflake
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 27 changed files in this pull request and generated 21 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

List of dropped schema names.
"""
if database:
_ = session.sql(f"USE DATABASE {database}").collect()
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL injection vulnerability: Using f-string formatting to construct USE DATABASE SQL with user-controlled database parameter without proper escaping or parameterization. An attacker could inject malicious SQL through the database parameter. Use parameterized queries or proper identifier escaping instead.

Copilot uses AI. Check for mistakes.
-- Create internal stage for uploading local parquet files
-- Used when loading data via session.file.put() instead of external storage integration
CREATE OR REPLACE STAGE {schema_name}.{stage_name}
FILE_FORMAT = {schema_name}.PARQUET_FORMAT;
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing semicolon at the end of SQL statement. While Snowflake can execute statements without semicolons, it's a best practice to include them for consistency and clarity, especially when statements might be batched together.

Copilot uses AI. Check for mistakes.

for row in result:
db_name = str(row["DATABASE_NAME"]) # pyright: ignore[reportUnknownArgumentType]
_ = session.sql(f"DROP DATABASE IF EXISTS {db_name} CASCADE").collect()
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL injection vulnerability: Using f-string formatting to construct DROP DATABASE SQL with user-controlled database name from query results without proper escaping. An attacker could potentially inject malicious SQL through crafted database names. Use parameterized queries or proper identifier escaping instead.

Copilot uses AI. Check for mistakes.

# Post-test cleanup: drop the test database
logger.info(f"Dropping test database: {db_name}")
_ = session.sql(f"DROP DATABASE IF EXISTS {db_name} CASCADE").collect()
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL injection vulnerability: Using f-string formatting to construct SQL queries with user-controlled database name (db_name) without proper escaping or parameterization. An attacker could inject malicious SQL through the database name. Use parameterized queries or proper identifier escaping instead.

Copilot uses AI. Check for mistakes.
-- No credentials required for public buckets
CREATE OR REPLACE STAGE {schema_name}.{stage_name}
URL = '{bucket_url}'
FILE_FORMAT = {schema_name}.PARQUET_FORMAT
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing semicolon at the end of SQL statement. While Snowflake can execute statements without semicolons, it's a best practice to include them for consistency and clarity, especially when statements might be batched together.

Suggested change
FILE_FORMAT = {schema_name}.PARQUET_FORMAT
FILE_FORMAT = {schema_name}.PARQUET_FORMAT;

Copilot uses AI. Check for mistakes.

# Create the test database
logger.info(f"Creating test database: {db_name}")
_ = session.sql(f"CREATE DATABASE IF NOT EXISTS {db_name}").collect()
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL injection vulnerability: Using f-string formatting to construct SQL queries with user-controlled database name (db_name) without proper escaping or parameterization. An attacker could inject malicious SQL through the database name. Use parameterized queries or proper identifier escaping instead.

Copilot uses AI. Check for mistakes.
with create_session(snowflake_settings) as session:
# Switch to the test database
logger.info(f"Using test database: {test_database}")
_ = session.sql(f"USE DATABASE {test_database}").collect()
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL injection vulnerability: Using f-string formatting to construct SQL queries with user-controlled database name (test_database) without proper escaping or parameterization. An attacker could inject malicious SQL through the database name. Use parameterized queries or proper identifier escaping instead.

Copilot uses AI. Check for mistakes.
"TC002",
# Allow standard library imports outside type-checking blocks
"TC003",
# Allow TODO comments
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace after the TODO comment on line 118. This should be removed for consistency with code style guidelines.

Suggested change
# Allow TODO comments
# Allow TODO comments

Copilot uses AI. Check for mistakes.
@@ -0,0 +1 @@
CREATE SCHEMA IF NOT EXISTS {schema_name}
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing semicolon at the end of SQL statement. While Snowflake can execute statements without semicolons, it's a best practice to include them for consistency and clarity, especially when statements might be batched together.

Suggested change
CREATE SCHEMA IF NOT EXISTS {schema_name}
CREATE SCHEMA IF NOT EXISTS {schema_name};

Copilot uses AI. Check for mistakes.
…date pyproject.toml to remove readme reference
Copy link
Collaborator

@srnnkls srnnkls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

The implementation is functional but over-engineered for its scope. Per the proposed structure in #42:

getml-demo/integration/snowflake/
├── data/
│   ├── ingestion.py          # GCS → Snowflake loader
│   ├── preparation.py        # Orchestration module
│   └── sql/
│       ├── create_population.sql
│       ├── calculate_target.sql
│       └── ...
└── tests/
    └── test_data_pipeline.py  # Integration test

We don't need more than this. KISS and YAGNI.

Specific issues:

  • Single-purpose wrapper modules (_sql_loader.py, _snowflake_session.py) add indirection without value
  • Custom exceptions (BootstrapError, DataIngestionError) don't add semantic value over standard exceptions
  • Test classes used purely for namespacing rather than shared fixtures
  • pyproject.toml contains many explicit defaults that could be removed

See inline comments for details.

@@ -0,0 +1,6 @@
CREATE WAREHOUSE IF NOT EXISTS {warehouse_name}
WITH
WAREHOUSE_SIZE = '{warehouse_size}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be properly parameterized.

@@ -0,0 +1,4 @@
-- Create internal stage for uploading local parquet files
-- Used when loading data via session.file.put() instead of external storage integration
CREATE OR REPLACE STAGE {schema_name}.{stage_name}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need to parameterize schema and stage, you should validate the identifiers. The SQLIdentifier pattern used by @awaismirza92 is the right approach. Make this a shared resources and do the same.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snowflake has IDENTIFIER what is the right approach here. @awaismirza92's pydantic-approach, while fine, also feels like not fitting the use case (one-off preparation for notebooks).

@@ -0,0 +1,40 @@
"""SQL file loading utilities.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a 40-line module to wrap two lines of actual logic:

content = (some_path / path).read_text()
return content.format(**kwargs) if kwargs else content

The indirection buys you almost nothing:

  • Fixed directory, now your SQL must live in data/sql/, no flexibility
  • String formatting for SQL, which is the injection problem outlined above
  • No validation — kwargs go straight into .format()
  • Extra cognitive load — reader has to find this module to understand what load_sql does

Compare to just writing it inline where you need it:

sql = (SQL_DIR / "create_schema.sql").read_text().format(schema_name=validated_name)

Same thing, zero indirection, obvious what's happening.

If you were to keep a helper, it should at least earn its keep by adding the identifier validation:

def load_sql(path: str, **identifiers: SqlIdentifier) -> str:
    ...

But even then — do you have enough SQL files to justify this? Or is it one of those "I might need this later" abstractions that just adds grep distance?

@@ -0,0 +1,54 @@
"""Snowflake session management for getML Feature Store integration.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same energy. _establish_session is a 5-line function that wraps a one-liner and gets called exactly once.

Session.builder.configs(connection_params).create()

That's the whole thing. The try/except doesn't add much — you're catching Exception, wrapping it in a Snowpark-specific exception, and including... the same info that would already be in the traceback.

The entire module could be:

def create_session(settings: SnowflakeSettings) -> Session:
    return Session.builder.configs({
        "account": settings.account,
        "user": settings.user,
        "password": settings.password.get_secret_value(),
        "role": settings.role,
        "warehouse": settings.warehouse,
        "database": settings.database,
        "schema": settings.schema_name,
    }).create()

Or arguably just inline it at the call site, since there's probably only one or two places that create sessions.

The logger isn't even used. The docstrings are longer than the code. The _establish_session split suggests there was going to be retry logic or connection pooling that never materialized.

It's not wrong, it's just... enterprise-brained. Every concept gets a file, every operation gets a function, every function gets a docstring, and the actual logic is buried under ceremony.

Copy link
Collaborator

@srnnkls srnnkls Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is something wrong with your LLM guidelines:

  • Docstrings longer than the code they document
  • "Usage example" in every module — even trivial ones
  • Defensive splitting — _establish_session exists so there's a "private implementation" and a "public interface" for a one-liner
  • Over-typed obvious things — logger: logging.Logger = logging.getLogger(__name__)
  • Args/Returns/Raises for simple functions — documenting what's already obvious from the signature

AUTO_SUSPEND_SECONDS = 60


class BootstrapError(Exception):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the custom exception? This is a notebook ingestion script, not a library — no one will catch BootstrapError specifically. YAGNI. It feels wrong to spend so much of your complexity budget on stuff like this, when the actual SQL is just f"SHOW WAREHOUSES LIKE '{warehouse_name}'"?
Your guidelines seem to be optimizing for the wrong things: docstrings, module structure, custom exceptions — over actual robustness like input validation and keeping the code readable.

Comment on lines +53 to +58
[tool.pytest.ini_options]
pythonpath = ["."]
testpaths = ["tests"]
python_files = ["test_*.py"]
python_classes = ["Test*"]
python_functions = ["test_*"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of those are defaults and will, by convention, never change. Why the noise?

Comment on lines +70 to +71
venvPath = "."
venv = ".venv"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just defaults.

Comment on lines +72 to +73
reportMissingTypeStubs = false
reportImplicitStringConcatenation = false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rational behind setting those?

Comment on lines +85 to +86
line-length = 88
target-version = "py312"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaults (version is inferred from line 27).

Comment on lines +89 to +91
preview = false
quote-style = "double"
line-ending = "auto"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaults.

-- No credentials required for public buckets
CREATE OR REPLACE STAGE {schema_name}.{stage_name}
URL = '{bucket_url}'
FILE_FORMAT = {schema_name}.PARQUET_FORMAT;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use parameters.

))
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@{schema_name}.{stage_name}/{table_name}.parquet',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants