A comprehensive MLOps platform built with Apache Airflow for orchestrating machine learning workflows, including data preprocessing, model training, inference, and monitoring.
- Overview
- Features
- Architecture
- Prerequisites
- Quick Start
- Project Structure
- Configuration
- Running the Pipeline
- DAGs Overview
- Development
- Troubleshooting
- Contributing
This project provides a complete MLOps solution using Apache Airflow for:
- Data Ingestion: Automated data extraction from SQL databases
- Data Preprocessing: Feature engineering and data transformation pipelines
- Model Training: Automated model training with MLflow integration
- Model Inference: Production inference pipelines
- Monitoring: Model performance and data drift monitoring
- π³ Dockerized Environment: Complete containerized setup with Docker Compose
- π Automated Workflows: End-to-end ML pipelines with dependency management
- π MLflow Integration: Model versioning and experiment tracking
- π Notebook Execution: Papermill integration for parameterized notebook execution
- ποΈ Database Connectivity: Support for MSSQL and other databases
- π Monitoring: Built-in monitoring with Flower and custom metrics
- π§ Flexible Configuration: Environment-based configuration management
The platform consists of the following components:
- Airflow Webserver: Web UI for managing workflows (Port 8080)
- Airflow Scheduler: Orchestrates task execution
- Airflow Worker: Executes tasks using CeleryExecutor
- Flower: Monitoring dashboard for Celery workers (Port 5555)
- PostgreSQL: Metadata database for Airflow
- Redis: Message broker for task distribution
- MLflow: Model registry and experiment tracking
Before getting started, ensure you have the following installed:
- Docker: Community Edition (CE) with at least 4GB memory allocation
- Docker Compose: Version 1.29.1 or newer
- Git: For version control
- Python 3.8+: For local development (optional)
- Memory: Minimum 8GB RAM (4GB allocated to Docker)
- Storage: At least 10GB free disk space
- OS: Windows 10/11, macOS, or Linux
git clone <repository-url>
cd mlops-airflowCreate the required directories and environment file:
# Create necessary directories
mkdir -p ./logs ./plugins
# Create .env file (Windows)
echo AIRFLOW_UID=50000 > .env
# For Linux/macOS users
mkdir -p ./dags ./logs ./plugins
echo "AIRFLOW_UID=$(id -u)" > .env# Navigate to docker directory
cd docker
# Build custom images
docker-compose build
# Initialize the database
docker-compose up airflow-init# Start all services
docker-compose up -d
# Check container health
docker-compose ps- Airflow UI: http://localhost:8080
- Flower Dashboard: http://localhost:5555
- Default Credentials:
airflow/airflow
mlops-airflow/
βββ artifacts/ # Generated artifacts and outputs
βββ docker/ # Docker configuration
β βββ docker-compose.yml # Main compose file
β βββ Dockerfile # Custom Airflow image
β βββ requirements.txt # Python dependencies
β βββ airflow_worker/ # Worker-specific configuration
β βββ config/ # Airflow configuration files
β βββ mlflow_dockerfile/ # MLflow service configuration
βββ mlproject/ # Main project code
β βββ clients/ # Client-specific implementations
β βββ dags/ # Airflow DAGs
β β βββ agent_rigor.py # Data quality validation
β β βββ geo.py # Geography processing
β β βββ inference_dag.py # Model inference pipeline
β β βββ populate.py # Data population
β β βββ notebooks/ # Jupyter notebooks for processing
β β βββ statements/ # SQL statements and queries
β βββ engine/ # Core engine modules
β βββ config.py # Configuration management
β βββ helpers/ # Helper utilities
β βββ scripts/ # Execution scripts
βββ prj_requirements/ # Project requirements
βββ tables/ # Database table definitions
βββ README.md # This file
Key configuration options in your .env file:
# Airflow Configuration
AIRFLOW_UID=50000
AIRFLOW_IMAGE_NAME=apache/airflow:2.5.1
# Database Configuration
POSTGRES_USER=airflow
POSTGRES_PASSWORD=airflow
POSTGRES_DB=airflow
# MLflow Configuration
MLFLOW_BACKEND_STORE_URI=sqlite:///mlflow.db
MLFLOW_DEFAULT_ARTIFACT_ROOT=./mlrunsThe project includes machine learning and data processing libraries:
- Data Processing: pandas, numpy, xlrd, unidecode
- ML Libraries: lightgbm, xgboost, scikit-learn, imblearn
- Database: pymssql for SQL Server connectivity
- Notebook Execution: papermill, apache-airflow-providers-papermill
- Geospatial: geopy for location processing
- Optimization: hyperopt for hyperparameter tuning
cd docker
docker-compose up -d- Airflow Web UI: Navigate to http://localhost:8080
- Login: Use
airflow/airflow - Enable DAGs: Toggle the DAGs you want to run
- Monitor: Use the Graph View to monitor execution
Execute Airflow commands:
# Run airflow commands
docker-compose exec airflow-worker airflow info
# Access interactive shell
docker-compose exec airflow-worker bash
# View logs
docker-compose logs airflow-scheduleragent_rigor.py: Data quality validation and cleansinggeo.py: Geospatial data processing and enrichmentinference_dag.py: Model inference and prediction pipelinepopulate.py: Database population and data ingestion
The platform executes Jupyter notebooks as part of the workflow:
data_split.ipynb: Training/testing data splittingmain_data_prep.ipynb: Primary data preprocessinginference_4_prod.ipynb: Production inference pipelinegeo.ipynb: Geographic data processingutente.ipynb: User-specific data processing
- Create your DAG file in
mlproject/dags/ - Follow Airflow best practices
- Use the provided helper functions from
utils.py - Test locally before deployment
To add new Python packages:
- Update
docker/requirements.txt - Rebuild the Docker image:
docker-compose build docker-compose up -d
Configure database connections in the Airflow UI:
- Go to Admin β Connections
- Add your database connection details
- Use the connection ID in your DAGs
Services won't start:
# Check logs
docker-compose logs
# Restart services
docker-compose restartPermission issues (Linux/macOS):
# Fix ownership
sudo chown -R $(id -u):$(id -g) ./logs ./pluginsOut of memory:
- Increase Docker memory allocation to 4GB+
- Monitor container resource usage
Database connection errors:
- Verify connection settings in Airflow UI
- Check network connectivity
- Validate credentials
# Check all container status
docker-compose ps
# View specific service logs
docker-compose logs [service-name]
# Test Airflow scheduler
docker-compose exec airflow-scheduler airflow scheduler --helpdocker-compose down# Stop and remove everything
docker-compose down --volumes --rmi all
# Remove project directory (if needed)
# rm -rf /path/to/mlops-airflow# Clean up
docker-compose down --volumes --remove-orphans
# Remove images
docker-compose down --rmi all
# Start fresh
docker-compose up airflow-init
docker-compose up -d- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature - Commit changes:
git commit -am 'Add your feature' - Push to branch:
git push origin feature/your-feature - Submit a Pull Request
- Apache Airflow Documentation
- MLflow Documentation
- Docker Compose Documentation
- Papermill Documentation
Note: This setup is optimized for development and testing. For production deployment, additional security configurations and resource optimizations are recommended.