Goal: Build a cost-effective (~$10–30/month), serverless-first Lakehouse for a meat distribution platform, demonstrating modern data engineering practices (Iceberg, DataPlex, Data Vault 2.0 + Kimball, Terraform IaC, CI/CD).
Key Technologies:
- Data Source: A synthetic data generator that simulates a stream of meat processing data.
- Ingestion: Cloud Run service (Python container) triggered by Cloud Scheduler.
- Bronze Layer: Raw JSON/Parquet files in GCS.
- Silver Layer: Data Vault 2.0 modeled Iceberg tables in GCS.
- Gold Layer: Kimball star schema views or materialized tables queried via BigQuery (over Iceberg/BigLake).
- Transformations: Dataproc Serverless Spark (PySpark) batches for Iceberg support.
- Catalog & Governance: DataPlex Universal Catalog (auto-discovery, lineage).
- BI: Looker Studio (free) public dashboards.
- IaC: OpenTofu for everything, configured using HCL.
- CI/CD & Testing: GitHub Actions (lint, plan, tests, apply on merge).
repo-root/
├── .github/workflows/ # GitHub Actions CI/CD pipelines
│ ├── deploy.yml
│ └── ingestion.yml
├── infra/ # Core infrastructure (WIF, deploy SA, permissions)
│ ├── main.tf
│ └── ...
├── ingestion/
│ └── synthetic-meat/ # Source for the Cloud Run ingestion service
│ ├── src/
│ ├── tests/
│ ├── Dockerfile
│ └── pyproject.toml
├── warehouse/ # Data platform infrastructure (GCS, Dataplex, etc.)
│ ├── main.tf
│ └── ...
└── README.md
- Create a new GCP project, enable billing.
- Enable required APIs:
- Cloud Scheduler API
- Cloud Build API
- Dataproc API
- BigQuery API
- DataPlex API
- Cloud Storage API
- Install locally:
gcloudCLI, Terraform, Git. - Create GitHub repo and clone locally.
Deploy in this order:
- GCS buckets:
${project_id}-bronze${project_id}-silver${project_id}-deps(for Spark jars/temp)
- DataPlex Lake with zones:
- Lake:
meat-market-lake - Zones:
raw(bronze),curated(silver) - Assets linking buckets to zones
- Lake:
- BigQuery dataset:
gold_meat_market - Service accounts & IAM:
- One for Dataproc (BigQuery, Storage, DataPlex roles)
- BigLake connection (if needed for Iceberg catalog)
Use community modules where possible (e.g., GoogleCloudPlatform/cloud-foundation-fabric).
Validate locally: tofu init → fmt → validate → plan → apply.
This repository uses a two-part deployment strategy:
-
Core Infrastructure (
infra/): This configuration sets up the foundational components for CI/CD, including the Workload Identity Federation, the deployment service account, and its project-level IAM permissions. Because it grants powerful permissions, it is designed to be applied manually from a local machine after careful review. Any changes to IAM roles ininfra/main.tfmust be applied locally before they will take effect in the CI/CD pipeline.# From your local machine, inside the infra/ directory tofu apply -var-file="prod.tfvars"
-
Warehouse Infrastructure (
warehouse/): This configuration defines the application-specific infrastructure, such as GCS buckets, and Dataplex assets. It is deployed automatically by the GitHub Actions workflow (.github/workflows/deploy.yml) whenever changes are pushed to thewarehouse/directory.
- Data Source: A synthetic data generator script (Python) that simulates a stream of meat processing data.
- Methodology:
- Use aggregated public data (e.g., from MLA) as a baseline for realistic distributions of weights (e.g., 250-400kg HSCW), grades, and prices.
- Use a library like
polarsto generate thousands of "fake" individual animal/carcass records. - Sample attributes like weight from normal distributions based on grade and animal class.
- Assign pseudo-random identifiers (e.g., RFID-style tags) for traceability.
- Calculate prices based on grid formulas, applying premiums/discounts for factors like marbling, fat depth, and yield.
- Include additional fields for rich analytics, such as slaughter date, processing plant ID, breed, and quality scores.
- Container: The data generation logic is packaged as a Docker container and deployed as a serverless Cloud Run service.
- Execution: A Cloud Scheduler job triggers the Cloud Run service via an HTTP request on a daily schedule.
- The service generates a new batch of data upon each invocation.
- It converts the generated data to Parquet format.
- It writes the partitioned data to the bronze GCS bucket, e.g.,
gs://bronze/carcasses/year=2025/month=12/day=27/plant_id=P01/batch_12345.parquet
- Discovery: DataPlex automatically discovers the new Parquet files as they land, making them available for querying via BigLake.
Use Dataproc Serverless PySpark batch:
- Catalog: BigLakeCatalog (integrated with DataPlex).
- Read bronze Parquet.
- Build DV2 entities:
- Hub_Carcass (business key: carcass_id/rfid_tag)
- Hub_Processor (business key: plant_id)
- Sat_Carcass_Details (quality scores, weights, grades)
- Link_Carcass_Processing (linking carcasses to processing events)
- Write as Iceberg tables in silver bucket, partitioned appropriately.
- Trigger initially manual (
gcloud dataproc batches submit), later via Scheduler or Pub/Sub on new bronze files.
- Create dimension/fact tables (e.g., dim_product, dim_date, fact_trades).
- Materialize as:
- BigQuery native tables (recommended), or
- Iceberg tables queried via BigLake.
- Use views in BigQuery for final Kimball schema.
- Connect Looker Studio to BigQuery
gold_meat_marketdataset. - Build dashboards:
- Carcass weight distribution by grade
- Average price per kg over time
- Yield analysis by processing plant
- Make dashboards public (share link) for portfolio demo.
GitHub Actions Workflow (on push/PR and merge):
- OpenTofu fmt/validate/plan
- Unit tests (pytest) for the data generator and transformation logic.
- Integration tests (optional separate test project):
- Deploy infra
- Trigger ingestion
- Assert files in GCS
- Run Spark job
- Query BigQuery for expected rows
- On main merge (with approval):
terraform apply
Testing Tips:
- Mock API calls in unit tests.
- Use local Spark for DV logic testing.
- Keep tests fast and idempotent.
- OpenTofu apply → DataPlex lake + buckets visible.
- Run data generator → Files land in bronze → BigLake table auto-created.
- Run Dataproc Spark job → Iceberg tables in silver → Queryable in BigQuery.
- Build Looker Studio dashboard → Data visualized.
- CI/CD pipeline runs successfully on a commit.