Skip to content

Conversation

@cyclux
Copy link
Contributor

@cyclux cyclux commented Dec 15, 2025

This pull request introduces a Snowflake data integration layer for the getML Feature Store, including configuration, infrastructure bootstrapping, and data ingestion utilities. It provides a modular, environment-variable-driven setup for connecting to Snowflake, ensures required infrastructure (warehouse/database) is present, and supplies SQL templates and scripts for automating data preparation and ingestion. Additionally, it adds tools for converting Jaffle Shop CSV data to Parquet format for efficient loading.

Snowflake Data Integration Layer

  • Core data integration package:

    • Adds data package with modules for Snowflake settings, session management, infrastructure bootstrapping, SQL loading utilities, and top-level imports for streamlined usage. (integration/snowflake/data/__init__.py, integration/snowflake/data/_settings.py, integration/snowflake/data/_snowflake_session.py, integration/snowflake/data/_bootstrap.py, integration/snowflake/data/_sql_loader.py) [1] [2] [3] [4] [5]
    • Introduces robust, environment-based configuration via SnowflakeSettings and context-managed Snowpark session creation. [1] [2]
    • Ensures Snowflake warehouse and database are automatically created if missing, with idempotent SQL and clear error handling.
  • SQL automation and templates:

    • Adds reusable SQL templates for creating warehouses, databases, schemas, file formats, stages, tables, and for copying and analyzing data. These enable automated, parameterized infrastructure and ingestion workflows. (integration/snowflake/data/sql/...) [1] [2] [3] [4] [5] [6] [7] [8] [9]

Jaffle Shop Data Preparation

  • Parquet conversion and documentation:
    • Adds a script to convert Jaffle Shop CSV files to Parquet format for efficient storage and downstream processing, with clear documentation for generating, converting, and uploading the data. (integration/jaffle-shop-data/convert_jaffle_csv_to_parquet.py, integration/jaffle-shop-data/GENERATE_JAFFLE_SHOP_PARQUET.md) [1] [2]

Project Configuration

  • Python project setup:
    • Introduces a pyproject.toml with dependencies for data engineering, Snowflake integration, development, and linting, ensuring reproducible environments and code quality. (integration/pyproject.toml)

- Replaced specific version constraints with more flexible ones for better compatibility.
- Added 'getml' as a new dependency.
- Adjusted version specifications for existing dependencies to use compatible ranges.
@cyclux cyclux self-assigned this Dec 15, 2025
@cyclux cyclux linked an issue Dec 15, 2025 that may be closed by this pull request
@srnnkls srnnkls changed the base branch from master to 54-add-ingestion-module-for-gcss3-resources-to-snowflake December 15, 2025 11:11
@srnnkls srnnkls changed the base branch from 54-add-ingestion-module-for-gcss3-resources-to-snowflake to 55-add-data-preparation-orchestration-module December 15, 2025 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create initial snowflake notebook (5 sections)

2 participants