Skip to content

Enterprise-grade Data Platform for NYC Taxi Analytics. Orchestrated with Airflow (Astro) & dbt, served via FastAPI & Power BI. Features Medallion Architecture, Data Quality Observability (Slack), and Star Schema modeling.

Notifications You must be signed in to change notification settings

ChahiriAbderrahmane/modern-data-stack-nyc-taxi

Repository files navigation

πŸš– NYC Taxi Data Engineering Platform

End-to-End ELT Pipeline | Data Warehouse | BI & API Microservices

An enterprise-grade Data Engineering project transforming raw NYC Taxi data into actionable insights via a modern stack: Airflow, dbt, PostgreSQL, FastAPI, Power BI, and Slack.

Project Architecture

Project Architecture


πŸ“ Table of Contents

  1. Project Overview
  2. Architecture & Data Modeling
  3. Business Intelligence (Dashboards)
  4. Orchestration (Airflow)
  5. Data Products (API)
  6. Observability & Alerting
  7. Performance & Optimization
  8. Installation
  9. Contact

πŸ”­ Project Overview

This project simulates a real-world data platform for a Taxi company. It ingests high-volume trip data, cleanses it, models it into a Star Schema, and serves it to different stakeholders (Executives, Operations, Finance) via Dashboards and APIs.

Key Features:

  • ELT Pipeline: Ingestion of raw CSVs into Bronze/Silver/Gold layers using dbt and Postgres.
  • Data Quality: Automated testing and "Revenue at Risk" calculation to detect anomalies (negative fares, time travel).
  • Microservice API: A standalone FastAPI container serving Gold data to external apps.
  • Observability: Slack alerting for data quality breaches.

πŸ—οΈ Architecture & Data Modeling

The project follows the Medallion Architecture (Bronze -> Silver -> Gold).

The Star Schema (Gold Layer)

I transformed the data into a rigorous dimensional model optimized for BI performance.

Star Schema

Entity Relationship Diagram (ERD) generated from the Gold Layer.

Aggregations for BI

To handle millions of rows efficiently in Power BI, specific Data Marts aggregate views were created with dbt. The "_Key Measures" table was created in powerbi to gather the measures created with DAX code. Aggregation Tables

πŸ“Š Business Intelligence (Power BI)

The final product is a comprehensive Power BI Report (.pbit) containing 4 specialized views.

1. Executive Pulse (C-Level)

Focus: Year-over-Year growth, Total Revenue, and High-level trends. Executive Dashboard

2. Operations & Traffic (Fleet Managers)

Focus: Filled map, Borough-to-Borough flow, and RPM (Revenue Per Minute) optimization. Ops Dashboard

3. Financial Performance (Finance Depatement)

Focus: Payment methods adoption (Cash vs Card), Tipping behavior, and Fare buckets. Finance Dashboard

4. Data Quality Monitor (Data Engineering Team)

Focus: Pipeline health, Invalid records tracking, and Revenue at Risk ($). Quality Dashboard

Feature Highlight: Tooltips allow users to hover over data points for granular details. It works only in the first dashboard, in the line chart.

Tooltip

πŸŒͺ️ Orchestration (Apache Airflow)

The entire pipeline is orchestrated via Astro CLI (Airflow).

The Main Pipeline

Handles the end-to-end flow: dbt run (Bronze/Silver/Gold), dbt test, and data freshness checks. Main DAG

Static Dimensions & Utility DAGs

Separate DAG to manage static data to optimize runtime.

static dimensions

πŸš€ Data Products: FastAPI Microservice

Beyond dashboards, this project exposes a REST API for application developers. The API runs in an isolated Docker container but communicates with the same Data Warehouse.

  • Endpoint: /metrics/daily (Supports date filtering)
  • Architecture: Dockerized FastAPI service networked with Postgres.

FastAPI Response

🚨 Observability & Alerting

I implemented a Reverse ETL logic to proactively notify the team when Data Quality degrades. If the Revenue at Risk exceeds a threshold (e.g., $10k), a Slack alert is triggered automatically.

Alerting DAG Alerting DAG Slack Alert Message Slack Alert

⚑ Performance & Optimization

I optimized the pipeline architecture by decoupling static data processing from the daily workflow.

Initially, the DAG was monolithic, rebuilding all Dimensions and Facts on every run. Strategy: I extracted static dimensions into a separate DAG (static_dimensions_dag) that runs only on-demand, leaving the main pipeline to process only new incoming trip data.

Before Optimization After Optimization
Before After
Monolithic DAG:
Rebuilding static dimensions & facts every time.
(High Latency)
Decoupled Architecture:
Static dims separated.
Only processing new data.
(Drastic reduction in runtime)

Airlfow ui dags

Airlfow ui dags

πŸ’» How to Run

Prerequisites

  • Docker & Docker Compose
  • Astro CLI
  • Power BI Desktop (to view .pbit)

Steps

  1. Clone the repository

    git clone [https://github.com/ChahiriAbderrahmane/modern-data-stack-nyc-taxi.git)
  2. Start the Data Platform (Airflow + Postgres)

    astro dev start
  3. Start the API Microservice

    docker compose -f docker-compose-api.yml up --build
  4. Access the Interfaces

    • Airflow: http://localhost:8080
    • FastAPI Docs: http://localhost:8000/docs
    • Power BI: Open assets/nyc_project_dashboard.pbit

πŸ“¨ Contact Me

LinkedIn β€’ Gmail

Made with ❀️ by Abderrahmane Chahiri

About

Enterprise-grade Data Platform for NYC Taxi Analytics. Orchestrated with Airflow (Astro) & dbt, served via FastAPI & Power BI. Features Medallion Architecture, Data Quality Observability (Slack), and Star Schema modeling.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published