A hands-on project demonstrating a complete ETL (Extract, Transform, Load) pipeline for sales data using Python and pandas.
The repository is designed for practicing data engineering workflows, Git collaboration, and real-world data cleaning tasks.
- Build a reproducible ETL pipeline for messy sales data.
- Practice using Git and GitHub for collaborative workflows (branching, merging, resolving conflicts).
- Showcase data engineering skills with professional project documentation.
graph TD
A[etl-sales-data] --> B[data]
B --> B1[raw_sales.csv]
A --> C[etl]
C --> C1[etl_pipeline.py]
A --> D[README.md]
- Python 3 β programming language for ETL logic
- pandas β data manipulation and cleaning
- CSV β raw data source and output format
- Git β version control system for collaboration
- GitHub β remote repository for hosting and teamwork
- Load raw sales data from
data/raw_sales.csv. - The dataset includes fields such as:
- Transaction ID
- Date
- Product & Category
- Quantity & Price (with missing values)
- Customer ID
- Region
- Standardize column names (lowercase, underscores instead of spaces).
- Handle missing values:
- Replace missing
Pricevalues withmean of the total value.
- Replace missing
- Add derived fields:
total_sales = quantity * price
- Ensure clean, consistent output ready for analysis.
- Save the cleaned dataset as
data/clean_sales.csv. - Future extension: load into SQLite or PostgreSQL for analytics.
- Python 3.8 or later
- Git installed
pandaslibrary installed
git clone https://github.com/DevMuzee/etl-sales-data.git
cd etl-sales-datagraph TD
A[Raw CSV Data] --> B[Extract]
B --> C[Transform]
C -->|Clean column names| C1[Standardize schema]
C -->|Handle missing values| C2[Fill NA in Price]
C -->|Create derived columns| C3[total_sales = qty * price]
C --> D[Load]
D --> E[Clean CSV / Database]
- Add unit tests for transformations
- Load cleaned data into SQLite/PostgreSQL
- Schedule pipeline runs using Apache Airflow or Prefect
- Extend project with data visualization (Matplotlib/Seaborn)
- Integrate CI/CD pipelines for automated testing
This project is for educational purposes only. The dataset provided is synthetic and does not represent real sales data.
###π Acknowledgements
Inspired by real-world data engineering pipelines Designed as a learning project for Git collaboration and ETL practice