-
Notifications
You must be signed in to change notification settings - Fork 8
Add data preparation module for Databricks notebook #63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 50-data-ingestion-for-databricks-notebook
Are you sure you want to change the base?
Add data preparation module for Databricks notebook #63
Conversation
This is a squashed commit of the following: commit 00f30c5 Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com> Date: Mon Dec 15 17:55:59 2025 +0100 Use exact version specifications for databricks dependencies in uv.lock commit 447d527 Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com> Date: Mon Dec 15 17:55:45 2025 +0100 Remove comments leaking LLM focus from ingestion module commit da01415 Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com> Date: Mon Dec 15 17:18:41 2025 +0100 Use exact matches for databricks group in pyproject.toml commit e46fe23 Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com> Date: Mon Dec 15 17:15:44 2025 +0100 Close spark session if the function creates it commit 962be5b Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com> Date: Mon Dec 15 16:56:26 2025 +0100 Remove trailing blank lines at end of pyproject.toml commit e009302 Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com> Date: Mon Dec 15 16:53:35 2025 +0100 Remove ignore comment for model_config in TableConfig commit c4a0955 Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com> Date: Mon Dec 15 16:49:36 2025 +0100 Move logging suppression to a dedicated function commit c006495 Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com> Date: Mon Dec 15 16:37:01 2025 +0100 Add debug logging for successful volume file cleanup commit 0a19854 Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com> Date: Mon Dec 15 16:15:35 2025 +0100 Use module level constant for request timeout & increase it to 300s commit 375adfd Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com> Date: Mon Dec 15 16:06:06 2025 +0100 Remove the mutation of `os.environ` commit b3af7fc Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com> Date: Fri Dec 12 12:37:34 2025 +0100 Suppress INFO/WARNING from absl/glog commit f86c976 Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com> Date: Thu Dec 11 18:28:22 2025 +0100 Fix bug in string wrapping commit 1fcbd4f Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com> Date: Wed Dec 10 17:16:42 2025 +0100 Rename variable name for clarity commit 0a780e3 Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com> Date: Wed Dec 10 17:11:10 2025 +0100 Add parameter markers and pydantic validation to avoid SQL injection commit 1011f94 Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com> Date: Wed Dec 10 12:00:50 2025 +0100 Document DEFAULT profile, remove python & preparation commit dc86916 Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com> Date: Tue Dec 9 22:47:08 2025 +0100 Remove pandas by streaming files to volume commit 1eb9a88 Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com> Date: Tue Dec 9 11:39:08 2025 +0100 Replace relative import with absolute import in Databricks ingestion commit 6464559 Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com> Date: Wed Dec 3 14:10:58 2025 +0100 Migrate to pyproject.toml & uv commit 7c91e7b Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com> Date: Tue Dec 2 16:39:22 2025 +0100 Mention import of injection function commit f329566 Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com> Date: Tue Dec 2 16:06:44 2025 +0100 Make the injection function callable commit aa9a410 Author: Awais Mirza <47096947+awaismirza92@users.noreply.github.com> Date: Tue Dec 2 15:05:55 2025 +0100 Add Databricks ingestion script and requirements file
…data-preparation-module-for-databricks-notebook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not giving the same review two times. If you have common functionality to extract, please go ahead and do so. It is not my role as a reviewer to teach you the most basic software engineering best practices (DRY). The idea of separating database-specific logic from orchestration was giving you the ability to look for a common interface, not pseudo-applying lookalike enterprise patterns to simple data preparation scripts for notebook showcases.
See my review on #58 for detailed feedback. The same issues apply here.
Per #42, the target structure is:
getml-demo/integration/{platform}/
├── data/
│ ├── ingestion.py
│ ├── preparation.py
│ └── sql/
│ ├── create_weekly_stores.sql # should probably be called create_stores_per_week or something similar
│ └── calculate_target.sql # should probably called create_weekly_total_sales_per_store_per_week or something similar
└── tests/
└── test_data_pipeline.py
Specific issues:
- Remove _sql_loader.py (just inline the two lines)
- Remove models.py (define the data structures inline)
- Remove custom exceptions (DataPreparationError adds nothing)
- Consolidate 9 SQL files into 2
- Remove display/analysis functions from preparation.py (this is a data loader, not a reporting tool)
- Remove docstrings and ASCII banner logging
Closes #62