Skip to content

luismaiaDEVSCOPE/mlops-airflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

26 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MLOps Airflow Pipeline

A comprehensive MLOps platform built with Apache Airflow for orchestrating machine learning workflows, including data preprocessing, model training, inference, and monitoring.

πŸ“‹ Table of Contents

🎯 Overview

This project provides a complete MLOps solution using Apache Airflow for:

  • Data Ingestion: Automated data extraction from SQL databases
  • Data Preprocessing: Feature engineering and data transformation pipelines
  • Model Training: Automated model training with MLflow integration
  • Model Inference: Production inference pipelines
  • Monitoring: Model performance and data drift monitoring

✨ Features

  • 🐳 Dockerized Environment: Complete containerized setup with Docker Compose
  • πŸ”„ Automated Workflows: End-to-end ML pipelines with dependency management
  • πŸ“Š MLflow Integration: Model versioning and experiment tracking
  • πŸ““ Notebook Execution: Papermill integration for parameterized notebook execution
  • πŸ—„οΈ Database Connectivity: Support for MSSQL and other databases
  • πŸ“ˆ Monitoring: Built-in monitoring with Flower and custom metrics
  • πŸ”§ Flexible Configuration: Environment-based configuration management

πŸ—οΈ Architecture

The platform consists of the following components:

  • Airflow Webserver: Web UI for managing workflows (Port 8080)
  • Airflow Scheduler: Orchestrates task execution
  • Airflow Worker: Executes tasks using CeleryExecutor
  • Flower: Monitoring dashboard for Celery workers (Port 5555)
  • PostgreSQL: Metadata database for Airflow
  • Redis: Message broker for task distribution
  • MLflow: Model registry and experiment tracking

πŸ”§ Prerequisites

Before getting started, ensure you have the following installed:

  • Docker: Community Edition (CE) with at least 4GB memory allocation
  • Docker Compose: Version 1.29.1 or newer
  • Git: For version control
  • Python 3.8+: For local development (optional)

System Requirements

  • Memory: Minimum 8GB RAM (4GB allocated to Docker)
  • Storage: At least 10GB free disk space
  • OS: Windows 10/11, macOS, or Linux

πŸš€ Quick Start

1. Clone the Repository

git clone <repository-url>
cd mlops-airflow

2. Set Up Environment

Create the required directories and environment file:

# Create necessary directories
mkdir -p ./logs ./plugins

# Create .env file (Windows)
echo AIRFLOW_UID=50000 > .env

# For Linux/macOS users
mkdir -p ./dags ./logs ./plugins
echo "AIRFLOW_UID=$(id -u)" > .env

3. Build and Initialize

# Navigate to docker directory
cd docker

# Build custom images
docker-compose build

# Initialize the database
docker-compose up airflow-init

4. Start the Platform

# Start all services
docker-compose up -d

# Check container health
docker-compose ps

5. Access the Web Interface

πŸ“ Project Structure

mlops-airflow/
β”œβ”€β”€ artifacts/                 # Generated artifacts and outputs
β”œβ”€β”€ docker/                   # Docker configuration
β”‚   β”œβ”€β”€ docker-compose.yml    # Main compose file
β”‚   β”œβ”€β”€ Dockerfile            # Custom Airflow image
β”‚   β”œβ”€β”€ requirements.txt      # Python dependencies
β”‚   β”œβ”€β”€ airflow_worker/       # Worker-specific configuration
β”‚   β”œβ”€β”€ config/               # Airflow configuration files
β”‚   └── mlflow_dockerfile/    # MLflow service configuration
β”œβ”€β”€ mlproject/                # Main project code
β”‚   β”œβ”€β”€ clients/              # Client-specific implementations
β”‚   β”œβ”€β”€ dags/                 # Airflow DAGs
β”‚   β”‚   β”œβ”€β”€ agent_rigor.py    # Data quality validation
β”‚   β”‚   β”œβ”€β”€ geo.py            # Geography processing
β”‚   β”‚   β”œβ”€β”€ inference_dag.py  # Model inference pipeline
β”‚   β”‚   β”œβ”€β”€ populate.py       # Data population
β”‚   β”‚   β”œβ”€β”€ notebooks/        # Jupyter notebooks for processing
β”‚   β”‚   └── statements/       # SQL statements and queries
β”‚   └── engine/               # Core engine modules
β”‚       β”œβ”€β”€ config.py         # Configuration management
β”‚       β”œβ”€β”€ helpers/          # Helper utilities
β”‚       └── scripts/          # Execution scripts
β”œβ”€β”€ prj_requirements/         # Project requirements
β”œβ”€β”€ tables/                   # Database table definitions
└── README.md                # This file

βš™οΈ Configuration

Environment Variables

Key configuration options in your .env file:

# Airflow Configuration
AIRFLOW_UID=50000
AIRFLOW_IMAGE_NAME=apache/airflow:2.5.1

# Database Configuration
POSTGRES_USER=airflow
POSTGRES_PASSWORD=airflow
POSTGRES_DB=airflow

# MLflow Configuration
MLFLOW_BACKEND_STORE_URI=sqlite:///mlflow.db
MLFLOW_DEFAULT_ARTIFACT_ROOT=./mlruns

Custom Dependencies

The project includes machine learning and data processing libraries:

  • Data Processing: pandas, numpy, xlrd, unidecode
  • ML Libraries: lightgbm, xgboost, scikit-learn, imblearn
  • Database: pymssql for SQL Server connectivity
  • Notebook Execution: papermill, apache-airflow-providers-papermill
  • Geospatial: geopy for location processing
  • Optimization: hyperopt for hyperparameter tuning

πŸƒβ€β™‚οΈ Running the Pipeline

Starting the Platform

cd docker
docker-compose up -d

Accessing Services

  1. Airflow Web UI: Navigate to http://localhost:8080
  2. Login: Use airflow / airflow
  3. Enable DAGs: Toggle the DAGs you want to run
  4. Monitor: Use the Graph View to monitor execution

CLI Operations

Execute Airflow commands:

# Run airflow commands
docker-compose exec airflow-worker airflow info

# Access interactive shell
docker-compose exec airflow-worker bash

# View logs
docker-compose logs airflow-scheduler

πŸ“Š DAGs Overview

Available Workflows

  1. agent_rigor.py: Data quality validation and cleansing
  2. geo.py: Geospatial data processing and enrichment
  3. inference_dag.py: Model inference and prediction pipeline
  4. populate.py: Database population and data ingestion

Notebook Execution

The platform executes Jupyter notebooks as part of the workflow:

  • data_split.ipynb: Training/testing data splitting
  • main_data_prep.ipynb: Primary data preprocessing
  • inference_4_prod.ipynb: Production inference pipeline
  • geo.ipynb: Geographic data processing
  • utente.ipynb: User-specific data processing

πŸ› οΈ Development

Adding New DAGs

  1. Create your DAG file in mlproject/dags/
  2. Follow Airflow best practices
  3. Use the provided helper functions from utils.py
  4. Test locally before deployment

Extending Dependencies

To add new Python packages:

  1. Update docker/requirements.txt
  2. Rebuild the Docker image:
    docker-compose build
    docker-compose up -d

Database Connections

Configure database connections in the Airflow UI:

  • Go to Admin β†’ Connections
  • Add your database connection details
  • Use the connection ID in your DAGs

πŸ” Troubleshooting

Common Issues

Services won't start:

# Check logs
docker-compose logs

# Restart services
docker-compose restart

Permission issues (Linux/macOS):

# Fix ownership
sudo chown -R $(id -u):$(id -g) ./logs ./plugins

Out of memory:

  • Increase Docker memory allocation to 4GB+
  • Monitor container resource usage

Database connection errors:

  • Verify connection settings in Airflow UI
  • Check network connectivity
  • Validate credentials

Health Checks

# Check all container status
docker-compose ps

# View specific service logs
docker-compose logs [service-name]

# Test Airflow scheduler
docker-compose exec airflow-scheduler airflow scheduler --help

🧹 Cleaning Up

Stop Services

docker-compose down

Complete Cleanup (removes all data)

# Stop and remove everything
docker-compose down --volumes --rmi all

# Remove project directory (if needed)
# rm -rf /path/to/mlops-airflow

Restart from Scratch

# Clean up
docker-compose down --volumes --remove-orphans

# Remove images
docker-compose down --rmi all

# Start fresh
docker-compose up airflow-init
docker-compose up -d

πŸ“ Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/your-feature
  3. Commit changes: git commit -am 'Add your feature'
  4. Push to branch: git push origin feature/your-feature
  5. Submit a Pull Request

πŸ“š Additional Resources


Note: This setup is optimized for development and testing. For production deployment, additional security configurations and resource optimizations are recommended.