Modern Open Lakehouse Portfolio Project on Google Cloud

Goal: Build a cost-effective (~$10–30/month), serverless-first Lakehouse for a meat distribution platform, demonstrating modern data engineering practices (Iceberg, DataPlex, Data Vault 2.0 + Kimball, Terraform IaC, CI/CD).

Key Technologies:

Data Source: A synthetic data generator that simulates a stream of meat processing data.
Ingestion: Cloud Run service (Python container) triggered by Cloud Scheduler.
Bronze Layer: Raw JSON/Parquet files in GCS.
Silver Layer: Data Vault 2.0 modeled Iceberg tables in GCS.
Gold Layer: Kimball star schema views or materialized tables queried via BigQuery (over Iceberg/BigLake).
Transformations: Dataproc Serverless Spark (PySpark) batches for Iceberg support.
Catalog & Governance: DataPlex Universal Catalog (auto-discovery, lineage).
BI: Looker Studio (free) public dashboards.
IaC: OpenTofu for everything, configured using HCL.
CI/CD & Testing: GitHub Actions (lint, plan, tests, apply on merge).

Project Structure

repo-root/
├── .github/workflows/        # GitHub Actions CI/CD pipelines
│   ├── deploy.yml
│   └── ingestion.yml
├── infra/                    # Core infrastructure (WIF, deploy SA, permissions)
│   ├── main.tf
│   └── ...
├── ingestion/
│   └── synthetic-meat/       # Source for the Cloud Run ingestion service
│       ├── src/
│       ├── tests/
│       ├── Dockerfile
│       └── pyproject.toml
├── warehouse/                # Data platform infrastructure (GCS, Dataplex, etc.)
│   ├── main.tf
│   └── ...
└── README.md

Phase 1: Project Setup (1–2 hours)

Create a new GCP project, enable billing.
Enable required APIs:
- Cloud Scheduler API
- Cloud Build API
- Dataproc API
- BigQuery API
- DataPlex API
- Cloud Storage API
Install locally: gcloud CLI, Terraform, Git.
Create GitHub repo and clone locally.

Phase 2: Infrastructure with Terraform

Deploy in this order:

GCS buckets:
- ${project_id}-bronze
- ${project_id}-silver
- ${project_id}-deps (for Spark jars/temp)
DataPlex Lake with zones:
- Lake: meat-market-lake
- Zones: raw (bronze), curated (silver)
- Assets linking buckets to zones
BigQuery dataset: gold_meat_market
Service accounts & IAM:
- One for Dataproc (BigQuery, Storage, DataPlex roles)
BigLake connection (if needed for Iceberg catalog)

Use community modules where possible (e.g., GoogleCloudPlatform/cloud-foundation-fabric).

Validate locally: tofu init → fmt → validate → plan → apply.

Deployment Process

This repository uses a two-part deployment strategy:

Core Infrastructure (infra/): This configuration sets up the foundational components for CI/CD, including the Workload Identity Federation, the deployment service account, and its project-level IAM permissions. Because it grants powerful permissions, it is designed to be applied manually from a local machine after careful review. Any changes to IAM roles in infra/main.tf must be applied locally before they will take effect in the CI/CD pipeline.
```
# From your local machine, inside the infra/ directory
tofu apply -var-file="prod.tfvars"
```
Warehouse Infrastructure (warehouse/): This configuration defines the application-specific infrastructure, such as GCS buckets, and Dataplex assets. It is deployed automatically by the GitHub Actions workflow (.github/workflows/deploy.yml) whenever changes are pushed to the warehouse/ directory.

Phase 3: Data Generation & Ingestion to Bronze

Data Source: A synthetic data generator script (Python) that simulates a stream of meat processing data.
Methodology:
- Use aggregated public data (e.g., from MLA) as a baseline for realistic distributions of weights (e.g., 250-400kg HSCW), grades, and prices.
- Use a library like polars to generate thousands of "fake" individual animal/carcass records.
- Sample attributes like weight from normal distributions based on grade and animal class.
- Assign pseudo-random identifiers (e.g., RFID-style tags) for traceability.
- Calculate prices based on grid formulas, applying premiums/discounts for factors like marbling, fat depth, and yield.
- Include additional fields for rich analytics, such as slaughter date, processing plant ID, breed, and quality scores.
Container: The data generation logic is packaged as a Docker container and deployed as a serverless Cloud Run service.
Execution: A Cloud Scheduler job triggers the Cloud Run service via an HTTP request on a daily schedule.
- The service generates a new batch of data upon each invocation.
- It converts the generated data to Parquet format.
- It writes the partitioned data to the bronze GCS bucket, e.g., gs://bronze/carcasses/year=2025/month=12/day=27/plant_id=P01/batch_12345.parquet
Discovery: DataPlex automatically discovers the new Parquet files as they land, making them available for querying via BigLake.

Phase 4: Transformations

Bronze → Silver (Data Vault 2.0 with Iceberg)

Use Dataproc Serverless PySpark batch:

Catalog: BigLakeCatalog (integrated with DataPlex).
Read bronze Parquet.
Build DV2 entities:
- Hub_Carcass (business key: carcass_id/rfid_tag)
- Hub_Processor (business key: plant_id)
- Sat_Carcass_Details (quality scores, weights, grades)
- Link_Carcass_Processing (linking carcasses to processing events)
Write as Iceberg tables in silver bucket, partitioned appropriately.
Trigger initially manual (gcloud dataproc batches submit), later via Scheduler or Pub/Sub on new bronze files.

Silver → Gold (Kimball Star Schema)

Create dimension/fact tables (e.g., dim_product, dim_date, fact_trades).
Materialize as:
- BigQuery native tables (recommended), or
- Iceberg tables queried via BigLake.
Use views in BigQuery for final Kimball schema.

Phase 5: BI Layer

Connect Looker Studio to BigQuery gold_meat_market dataset.
Build dashboards:
- Carcass weight distribution by grade
- Average price per kg over time
- Yield analysis by processing plant
Make dashboards public (share link) for portfolio demo.

Phase 6: CI/CD and Testing

GitHub Actions Workflow (on push/PR and merge):

OpenTofu fmt/validate/plan
Unit tests (pytest) for the data generator and transformation logic.
Integration tests (optional separate test project):
- Deploy infra
- Trigger ingestion
- Assert files in GCS
- Run Spark job
- Query BigQuery for expected rows
On main merge (with approval): terraform apply

Testing Tips:

Mock API calls in unit tests.
Use local Spark for DV logic testing.
Keep tests fast and idempotent.

Validation Milestones (Quick Wins)

OpenTofu apply → DataPlex lake + buckets visible.
Run data generator → Files land in bronze → BigLake table auto-created.
Run Dataproc Spark job → Iceberg tables in silver → Queryable in BigQuery.
Build Looker Studio dashboard → Data visualized.
CI/CD pipeline runs successfully on a commit.

Name		Name	Last commit message	Last commit date
Latest commit History 557 Commits
.github/workflows		.github/workflows
dag		dag
docs		docs
infra		infra
ingestion		ingestion
scripts		scripts
warehouse		warehouse
.aider.conf.yml		.aider.conf.yml
.env.template		.env.template
.envrc		.envrc
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modern Open Lakehouse Portfolio Project on Google Cloud

Project Structure

Phase 1: Project Setup (1–2 hours)

Phase 2: Infrastructure with Terraform

Deployment Process

Phase 3: Data Generation & Ingestion to Bronze

Phase 4: Transformations

Bronze → Silver (Data Vault 2.0 with Iceberg)

Silver → Gold (Kimball Star Schema)

Phase 5: BI Layer

Phase 6: CI/CD and Testing

Validation Milestones (Quick Wins)

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

pasunboneleve/meat-dist-data-platform

Folders and files

Latest commit

History

Repository files navigation

Modern Open Lakehouse Portfolio Project on Google Cloud

Project Structure

Phase 1: Project Setup (1–2 hours)

Phase 2: Infrastructure with Terraform

Deployment Process

Phase 3: Data Generation & Ingestion to Bronze

Phase 4: Transformations

Bronze → Silver (Data Vault 2.0 with Iceberg)

Silver → Gold (Kimball Star Schema)

Phase 5: BI Layer

Phase 6: CI/CD and Testing

Validation Milestones (Quick Wins)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages