This repository contains a robust Apache Airflow pipeline for processing weather data. The project provides an end-to-end data engineering workflow, covering data generation, cleaning, statistical analysis, and visualization. It's fully containerized using Docker, making it easy to set up and run.
- Data Generation: The pipeline automatically creates a synthetic CSV dataset with daily weather metrics like temperature, humidity, and wind speed.
- Data Cleaning: A dedicated task handles missing values (NaNs) to ensure data quality.
- Statistical Summary: It calculates key statistics (mean, median, variance) for multiple data columns.
- XCom Integration: The statistical summary is passed between tasks using XComs, demonstrating effective inter-task communication.
- Data Visualization: The project generates and saves trend charts for each weather metric using
matplotlib. - Dockerized Environment: The entire project is packaged in Docker, ensuring a consistent and reproducible environment.
- Docker installed on your machine.
- Docker Compose (included with Docker Desktop).
-
Clone the Repository
git clone [https://github.com/MayanzaGo/Weather-Data-Pipeline-with-Apache-Airflow.git](https://github.com/MayanzaGo/Weather-Data-Pipeline-with-Apache-Airflow.git) cd Weather-Data-Pipeline-with-Apache-Airflow -
Launch the Pipeline Build the Docker image and start the Airflow container in the background.
docker-compose up --build -d
-
Access Airflow UI Open your web browser and navigate to
http://localhost:8080. Log in with the default credentials:airflowfor both the username and password. Or look at Docker_airflow4\airflow\simple_auth_manager_passwords.json.generated -
Run the DAG
- Find the
weather_pipeline_dagin the Airflow UI. - Unpause the DAG using the toggle button.
- Manually trigger a run by clicking the "play" button.
- Find the
Dockerfile: Defines the project's environment and installs necessary Python libraries.docker-compose.yml: Configures the Docker container, exposing port8080and mounting the localairflowdirectory.airflow/: The main directory for Airflow files.dags/: Contains theweather_pipeline_dag.pyfile.include/: A shared folder for data files (input, output, and charts).
generated_weather_data.py: A helper script that generates the initial CSV data with some missing values.weather_pipeline_dag.py: The core Airflow DAG file that defines the workflow.
The weather_pipeline_dag consists of four sequential tasks:
clean_weather_data: Readsweather.csv, removes rows with missing data, and saves the cleaned file ascleaned_weather.csv.generate_summary: Computes statistical summaries (mean, median, etc.) for each numeric column and pushes this data to XCom.consume_summary: Pulls the summary data from XCom and prints it to the task logs, demonstrating how to retrieve data from a previous task.generate_charts: Creates and saves a trend plot for each weather metric usingmatplotlib.
Containerizing Airflow with custom Python dependencies
Data generation, cleaning, and preprocessing using Pandas
Scheduling and orchestrating tasks in Airflow DAGs
Producing visual data summaries with Matplotlib
Managing project structure for reproducible ETL pipelines
Created by Gael Mayanza ouamba