Skip to content

thefrieddan1/herl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sensor Data ETL and Report API

This project implements an ETL infrastructure for sensor data with SQL-based reporting and a REST API.

Project Structure

.
├── data/                   # Sensor data files
├── etl/                    # ETL pipeline components
│   ├── base.py            # Abstract base classes
│   ├── extractors.py      # CSV data extraction
│   ├── transformers.py    # Data transformation
│   ├── loaders.py         # DuckDB loading
│   └── pipeline.py        # ETL orchestration
├── api/                    # FastAPI application
│   └── main.py            # REST API endpoints
├── reports/                # SQL reports
│   └── summary_report.sql # Summary report query
├── tests/                  # Unit and integration tests
│   ├── test_etl.py        # ETL tests
│   ├── test_report.py     # Report tests
│   └── test_api.py        # API tests
├── main.py                 # ETL entry point
├── generate_report.py      # Report generation script
└── verify_db.py           # Database verification script

Installation

pip install -r requirements.txt

Usage

Part 1: ETL Pipeline

Run the ETL to load sensor data into DuckDB:

python main.py

This creates sensors.db with the sensor_readings table.

Storage Target: DuckDB database file (sensors.db)

Table Schema:

CREATE TABLE sensor_readings (
    machine_code VARCHAR,
    component_code VARCHAR,
    coordinate VARCHAR,
    sample_time BIGINT,      -- Epoch microseconds
    value DOUBLE,
    inserted_at TIMESTAMP
);

Part 2: Generate Report

Generate the summary report:

python generate_report.py

This outputs the report to console and saves it to report.csv.

Part 3: API Endpoint

Start the API server:

uvicorn api.main:app --reload

Test the API:

curl -X POST "http://localhost:8000/report" \
  -F "sensor_files=@2024-01-01.csv" \
  -F "sensor_files=@2024-01-02.csv" \
  -F "sensors_metadata=@Sensors.csv" \
  -F "machines_metadata=@Machines.csv"

API Documentation: http://localhost:8000/docs

Testing

Run all tests:

pytest

Run specific test suites:

pytest tests/test_etl.py      # ETL tests
pytest tests/test_report.py   # Report tests
pytest tests/test_api.py      # API tests

Design & Extensibility

Abstract Base Classes

The ETL pipeline uses abstract base classes (Extractor, Transformer, Loader) to enable easy swapping of:

  • Data sources: CSV → JSON, Parquet, API, etc.
  • Processing engines: Polars → Pandas, DuckDB, Spark, etc.
  • Storage targets: DuckDB → PostgreSQL, MySQL, Parquet, etc.

Technology Stack

  • Polars: Fast, memory-efficient data processing
  • DuckDB: Embedded analytical database
  • FastAPI: Modern, high-performance API framework
  • PyArrow: Zero-copy data transfer between Polars and DuckDB

Report Output

The summary report includes:

  • machine_name: User-friendly machine name
  • coordinate: Coordinate with the highest increase
  • value_avg: Average value on 2024-01-02
  • increase_in_value: Increase from 2024-01-01 to 2024-01-02
  • samples_cnt: Number of samples

About

Home Exercise Razor Labs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages