Build data ingestion infrastructure for Databricks notebook #51

awaismirza92 · 2025-12-02T14:07:01Z

Closes: #50 & #52

integration/databricks/data/ingestion.py

integration/databricks/ingest_to_databricks.py

integration/databricks/requirements.txt

integration/databricks/README.md

integration/databricks/data/ingestion.py

srnnkls · 2025-12-03T08:37:40Z

Review Summary

Style Reference: Python Style Guide (getml/code17-northstar#18)

Context: Requirements from #42

This PR implements Databricks ingestion, mirroring the pattern established in #42 (Build data preparation infrastructure for feature store notebooks). Issue #42 defines the expected architecture:

integration/{platform}/
├── data/
│   ├── ingestion.py      # GCS → Platform loader
│   ├── preparation.py    # Orchestration module  
│   └── sql/              # Externalized SQL queries
└── tests/

Expected usage pattern from #42:

from integration.{platform}.data import ingestion, preparation

ingestion.load_from_gcs(
    bucket="gs://static.getml.com/datasets/jaffle-shop/",
    destination_schema="RAW"
)

The key expectation: data warehouses/platforms should use their native capabilities to ingest from GCS - not download through Python.

Critical Deviation: Architecture

The current implementation downloads parquet files to local memory via requests.get() + pandas, then converts to Spark DataFrame. This fundamentally misunderstands how Spark/Databricks works.

Aspect	Expected (per #42 pattern)	Actual Implementation
Data flow	GCS → Spark → Delta (direct)	GCS → Python memory → pandas → Spark → Delta
Scalability	Distributed across cluster	Limited by local memory
Dependencies	`pyspark`, `databricks-connect`	+ `pandas`, `requests`, `pyarrow`
Performance	Native Spark parallelism	Single-threaded download

Correct approach:

# Spark reads parquet directly from URL - no local memory needed
spark.read.parquet(source_url).write.format("delta").saveAsTable(target_table)

Deviations from #42 Structure

Requirement from #42	Status in PR
`ingestion.py` module	✓ Present (but wrong approach)
`preparation.py` module	✗ Missing (but referenced in README)
`sql/` directory	✗ Missing
Integration tests	✗ Missing
`pyproject.toml`	✗ Uses `requirements.txt` instead

Issues Summary

Critical (blocking):

Architecture fundamentally wrong - must use Spark's native parquet reading

High priority:

DEFAULT_PROFILE = "Code17" - hardcoded personal config
Relative import in CLI script (from data import ingestion)
requirements.txt instead of pyproject.toml (project uses uv)
SQL injection potential in schema/catalog interpolation

Medium priority:

Cryptic variable names (pdf, sdf)
README references non-existent preparation module
README includes irrelevant Python version troubleshooting
Empty __init__.py without __all__ exports
Broad except Exception handling

Recommended Changes

Rewrite ingestion to use Spark native reading:

def load_table(spark: SparkSession, source_url: str, target_table: str) -> int:
    df = spark.read.parquet(source_url)
    df.write.format("delta").mode("overwrite").saveAsTable(target_table)
    return df.count()

Remove pandas/requests dependencies - they're not needed
Add pyproject.toml with uv-compatible structure
Either add preparation.py or remove references from README
Validate SQL identifiers before interpolation
Use absolute imports throughout

awaismirza92 · 2025-12-10T16:18:30Z

@srnnkls I am done with my responses to your previous comments. Please have a look.

integration/databricks/data/ingestion.py

srnnkls

Solid code overall. Nothing blocking, mostly style and minor suggestions.

integration/databricks/data/ingestion.py

pyproject.toml

integration/databricks/data/ingestion.py

srnnkls

One blocking issue: remove the os.environ mutation at line 168. The rest are minor suggestions.

integration/databricks/data/ingestion.py

awaismirza92 · 2025-12-15T16:59:21Z

One blocking issue: remove the os.environ mutation at line 168. The rest are minor suggestions.

@srnnkls I have addressed both the blocking issue and the minor suggestions. Have a look again.

srnnkls · 2025-12-16T13:06:36Z

integration/databricks/data/ingestion.py

+def _validate_sql_identifier(value: str) -> str:
+    """
+    Validate SQL identifier to prevent injection attacks.
+
+    Args:
+        value: Identifier to validate.
+
+    Returns:
+        The validated identifier.
+
+    Raises:
+        ValueError: If identifier contains invalid characters.
+    """
+    if not _IDENTIFIER_PATTERN.fullmatch(value):
+        msg = (
+            f"Invalid SQL identifier {value!r}. "
+            f"Must match pattern: {_IDENTIFIER_PATTERN.pattern!r}"
+        )
+        raise ValueError(msg)
+
+    return value


from typing import Annotated from sqlglot import exp from pydantic import AfterValidator def _quote_identifier(raw_identifier: str, dialect: str = "databricks") -> str: return exp.to_identifier(raw_identifier).sql(dialect=dialect) SqlIdentifier = Annotated[str, AfterValidator(_quote_identifier)]

Added: bcbc393

awaismirza92 · 2025-12-17T14:24:05Z

@srnnkls have a look again.

Add Databricks ingestion script and requirements file

aa9a410

awaismirza92 linked an issue Dec 2, 2025 that may be closed by this pull request

Build data ingestion infrastructure for Databricks notebook #50

Open

awaismirza92 self-assigned this Dec 2, 2025

awaismirza92 added 2 commits December 2, 2025 16:06

Make the injection function callable

f329566

Mention import of injection function

7c91e7b

awaismirza92 marked this pull request as ready for review December 2, 2025 15:43

awaismirza92 requested a review from srnnkls December 2, 2025 15:43

This comment was marked as outdated.

Sign in to view