Skip to content

Conversation

@awaismirza92
Copy link
Collaborator

Closes #62

This is a squashed commit of the following:

commit 00f30c5
Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com>
Date:   Mon Dec 15 17:55:59 2025 +0100

    Use exact version specifications for databricks dependencies in uv.lock

commit 447d527
Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com>
Date:   Mon Dec 15 17:55:45 2025 +0100

    Remove comments leaking LLM focus from ingestion module

commit da01415
Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com>
Date:   Mon Dec 15 17:18:41 2025 +0100

    Use exact matches for databricks group in pyproject.toml

commit e46fe23
Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com>
Date:   Mon Dec 15 17:15:44 2025 +0100

    Close spark session if the function creates it

commit 962be5b
Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com>
Date:   Mon Dec 15 16:56:26 2025 +0100

    Remove trailing blank lines at end of pyproject.toml

commit e009302
Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com>
Date:   Mon Dec 15 16:53:35 2025 +0100

    Remove ignore comment for model_config in TableConfig

commit c4a0955
Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com>
Date:   Mon Dec 15 16:49:36 2025 +0100

    Move logging suppression to a dedicated function

commit c006495
Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com>
Date:   Mon Dec 15 16:37:01 2025 +0100

    Add debug logging for successful volume file cleanup

commit 0a19854
Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com>
Date:   Mon Dec 15 16:15:35 2025 +0100

    Use module level constant for request timeout & increase it to 300s

commit 375adfd
Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com>
Date:   Mon Dec 15 16:06:06 2025 +0100

    Remove the mutation of `os.environ`

commit b3af7fc
Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com>
Date:   Fri Dec 12 12:37:34 2025 +0100

    Suppress INFO/WARNING from absl/glog

commit f86c976
Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com>
Date:   Thu Dec 11 18:28:22 2025 +0100

    Fix bug in string wrapping

commit 1fcbd4f
Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com>
Date:   Wed Dec 10 17:16:42 2025 +0100

    Rename variable name for clarity

commit 0a780e3
Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com>
Date:   Wed Dec 10 17:11:10 2025 +0100

    Add parameter markers and pydantic validation to avoid SQL injection

commit 1011f94
Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com>
Date:   Wed Dec 10 12:00:50 2025 +0100

    Document DEFAULT profile, remove python & preparation

commit dc86916
Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com>
Date:   Tue Dec 9 22:47:08 2025 +0100

    Remove pandas by streaming files to volume

commit 1eb9a88
Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com>
Date:   Tue Dec 9 11:39:08 2025 +0100

    Replace relative import with absolute import in Databricks ingestion

commit 6464559
Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com>
Date:   Wed Dec 3 14:10:58 2025 +0100

    Migrate to pyproject.toml & uv

commit 7c91e7b
Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com>
Date:   Tue Dec 2 16:39:22 2025 +0100

    Mention import of injection function

commit f329566
Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com>
Date:   Tue Dec 2 16:06:44 2025 +0100

    Make the injection function callable

commit aa9a410
Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com>
Date:   Tue Dec 2 15:05:55 2025 +0100

    Add Databricks ingestion script and requirements file
@awaismirza92 awaismirza92 self-assigned this Dec 15, 2025
…data-preparation-module-for-databricks-notebook
@awaismirza92 awaismirza92 marked this pull request as ready for review December 15, 2025 21:24
@awaismirza92 awaismirza92 requested a review from srnnkls December 15, 2025 21:24
Copy link
Collaborator

@srnnkls srnnkls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not giving the same review two times. If you have common functionality to extract, please go ahead and do so. It is not my role as a reviewer to teach you the most basic software engineering best practices (DRY). The idea of separating database-specific logic from orchestration was giving you the ability to look for a common interface, not pseudo-applying lookalike enterprise patterns to simple data preparation scripts for notebook showcases.

See my review on #58 for detailed feedback. The same issues apply here.

Per #42, the target structure is:

getml-demo/integration/{platform}/
├── data/
│   ├── ingestion.py
│   ├── preparation.py
│   └── sql/
│       ├── create_weekly_stores.sql # should probably be called create_stores_per_week or something similar
│       └── calculate_target.sql # should probably called create_weekly_total_sales_per_store_per_week or something similar
└── tests/
    └── test_data_pipeline.py

Specific issues:

  • Remove _sql_loader.py (just inline the two lines)
  • Remove models.py (define the data structures inline)
  • Remove custom exceptions (DataPreparationError adds nothing)
  • Consolidate 9 SQL files into 2
  • Remove display/analysis functions from preparation.py (this is a data loader, not a reporting tool)
  • Remove docstrings and ASCII banner logging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants