Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
aa9a410
Add Databricks ingestion script and requirements file
awaismirza92 Dec 2, 2025
f329566
Make the injection function callable
awaismirza92 Dec 2, 2025
7c91e7b
Mention import of injection function
awaismirza92 Dec 2, 2025
6464559
Migrate to pyproject.toml & uv
awaismirza92 Dec 3, 2025
1eb9a88
Replace relative import with absolute import in Databricks ingestion
awaismirza92 Dec 9, 2025
dc86916
Remove pandas by streaming files to volume
awaismirza92 Dec 9, 2025
1011f94
Document DEFAULT profile, remove python & preparation
awaismirza92 Dec 10, 2025
0a780e3
Add parameter markers and pydantic validation to avoid SQL injection
awaismirza92 Dec 10, 2025
1fcbd4f
Rename variable name for clarity
awaismirza92 Dec 10, 2025
f86c976
Fix bug in string wrapping
awaismirza92 Dec 11, 2025
b3af7fc
Suppress INFO/WARNING from absl/glog
awaismirza92 Dec 12, 2025
375adfd
Remove the mutation of `os.environ`
awaismirza92 Dec 15, 2025
0a19854
Use module level constant for request timeout & increase it to 300s
awaismirza92 Dec 15, 2025
c006495
Add debug logging for successful volume file cleanup
awaismirza92 Dec 15, 2025
c4a0955
Move logging suppression to a dedicated function
awaismirza92 Dec 15, 2025
e009302
Remove ignore comment for model_config in TableConfig
awaismirza92 Dec 15, 2025
962be5b
Remove trailing blank lines at end of pyproject.toml
awaismirza92 Dec 15, 2025
e46fe23
Close spark session if the function creates it
awaismirza92 Dec 15, 2025
da01415
Use exact matches for databricks group in pyproject.toml
awaismirza92 Dec 15, 2025
447d527
Remove comments leaking LLM focus from ingestion module
awaismirza92 Dec 15, 2025
00f30c5
Use exact version specifications for databricks dependencies in uv.lock
awaismirza92 Dec 15, 2025
bcbc393
Replace SQL identifier validation with quoting using sqlglot
awaismirza92 Dec 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,9 @@ ENV/
env.bak/
venv.bak/

# Environments variables file
.mise.local.toml

# Spyder project settings
.spyderproject
.spyproject
Expand All @@ -150,5 +153,11 @@ dmypy.json

# generated
*_spark/
metastore_db/
spark-warehouse/
*_pipeline/
.vscode/

# LLM files
AGENTS.md
.github/copilot-instructions.md
27 changes: 22 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ This repository contains different [Jupyter Notebooks](https://jupyter.org) to d
- [Experimenting Locally](#experimenting-locally)
- [Using Docker](#using-docker)
- [On the Machine (Linux/x64 \& arm64)](#on-the-machine-linuxx64--arm64)
- [Optional: Spark and Databricks Support](#optional-spark-and-databricks-support)
- [Notebooks](#notebooks)
- [Overview](#overview)
- [Descriptions](#descriptions)
Expand Down Expand Up @@ -93,13 +94,29 @@ The following commands will set up a Python environment with necessary Python li
```
$ git clone https://github.com/getml/getml-demo.git
$ cd getml-demo
$ pipx install hatch
$ hatch env create
$ hatch shell
$ pip install -r requirements.txt
$ jupyter-lab
$ pipx install uv
$ uv run jupyter-lab
```

#### Optional: Spark and Databricks Support

Some notebooks (e.g., `imdb.ipynb`, `online_retail.ipynb`) demonstrate exporting features to Spark SQL. For these, you need to install additional dependencies:

> [!IMPORTANT]
> The `spark` and `databricks` dependency groups are **mutually exclusive** and cannot be installed together. The `--isolated` flag runs the command in a temporary environment without affecting your main installation.

**For local Spark execution** (running Spark locally on your machine):
```
$ uv run --group spark --isolated jupyter-lab
```

**For Databricks integration** (connecting to Databricks compute):
```
$ uv run --group databricks jupyter-lab
```

See [integration/databricks/README.md](integration/databricks/README.md) for Databricks setup instructions.

> [!TIP]
> Install the [Enterprise trial version](https://getml.com/latest/enterprise/request-trial) via the [Install getML on Linux guide](https://getml.com/latest/install/packages/linux#install-getml-on-linux) to try the Enterprise features.

Expand Down
1 change: 1 addition & 0 deletions integration/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Integration modules for connecting getML with external platforms."""
124 changes: 124 additions & 0 deletions integration/databricks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# Databricks Data Integration

This directory contains modules for ingesting data from GCS into Databricks Delta Lake and preparing population tables for getML feature engineering.


## Prerequisites

- **Databricks Free Edition account** (or higher tier)
- **Databricks CLI** installed

## Setup

### 1. Install Databricks CLI

```bash
# macOS
brew install databricks/tap/databricks

# Linux & macOS & Windows
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
```

More: https://docs.databricks.com/gcp/en/dev-tools/cli/install

### 2. Install Dependencies with uv

> [!IMPORTANT]
> The `databricks` dependency group uses `databricks-connect`, which **cannot be installed alongside `pyspark`**. These packages are mutually exclusive. If you need local Spark execution (e.g., for notebooks like `imdb.ipynb`), use `uv run --group spark --isolated` instead to run in a temporary isolated environment.

```bash
# From the repository root
cd getml-demo

# Install uv if not already installed
pipx install uv

# Run jupyter lab after install dependencies included in the databricks group
$ uv run --group databricks jupyter-lab
```

### 3. Authenticate with Databricks

```bash
# Get your workspace URL from your Databricks Free Edition account
# It looks like: https://<workspace-id>.cloud.databricks.com

databricks auth login --host https://<your-workspace>.cloud.databricks.com --profile DEFAULT
```

This will open a browser for OAuth authentication. After successful login, `DEFAULT` profile is stored in ~/.databrickscfg.

### 4. Verify Authentication

```bash
databricks auth profiles
```

You should see your workspace listed.

## Usage

### Python API (Recommended)

Use the modules directly in notebooks or scripts:

```python
from integration.databricks.data import ingestion

# Load raw data from GCS to Databricks
loaded_tables = ingestion.load_from_gcs(
bucket="https://static.getml.com/datasets/jaffle_shop/",
destination_schema="jaffle_shop"
)
print(f"Loaded {len(loaded_tables)} tables")
```

### Load Specific Tables

```python
from integration.databricks.data import ingestion

# Load only the tables you need
ingestion.load_from_gcs(
destination_schema="RAW",
tables=["raw_customers", "raw_orders", "raw_items", "raw_products"]
)
```

### Configure the Databricks profile (optional)

The above steps created `DEFAULT` profile for Databricks authentication. The ingestion
module also defaults to `DEFAULT` profile. The authentication should work smoothly for
a single profile.

If you have multiple profiles (e.g for different Databricks hosts), you can set
`DATABRICKS_CONFIG_PROFILE` environment variable in `.mise.local.toml` (gitignored) to
pin a specific profile to be used for this project:

```toml
[env]
DATABRICKS_CONFIG_PROFILE = "Code17"
```

In this example, `Code17` profile will be used instead of `DEFAULT` one.

## Troubleshooting

### Authentication Errors

```bash
# Re-authenticate
databricks auth login --host https://<your-workspace>.cloud.databricks.com

# Check your profile
databricks auth env
```

### Connection Timeout

Free Edition has limited compute resources. If you see timeouts:
- Wait a few minutes and retry (serverless cold start can take few seconds or minutes)
- Check your quota in the Databricks workspace


1 change: 1 addition & 0 deletions integration/databricks/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Databricks integration for getML demos."""
Empty file.
Loading