This repository contains the solution for a technical case study, organized by exercises and deliverables.
View Technical Documentation on Notion
A detailed description of the technology stack that enables this solution, including project context, data schema, and modeling decisions.
hotmart/
├── data/
│ ├── raw/ # Raw input data (CSV files)
│ │ ├── product_item.csv
│ │ ├── purchase.csv
│ │ └── purchase_extra_info.csv
│ └── curated/ # Processed data (Parquet files)
│ └── gmv_daily_snapshot/ # Final dataset generated by ETL
├── notebooks/
│ └── show_gmv_daily_snapshot.ipynb # Example of populated final dataset
├── sql/
│ ├── ddl/ # Data Definition Language scripts
│ │ └── create_gmv_daily_snapshot.sql
│ └── queries/ # SQL queries
│ ├── top_producers_products.sql
│ └── daily_gmv_subsidiary.sql
├── src/
│ └── etl.py # ETL script
├── requirements.txt # Python dependencies
└── README.md
- File:
sql/queries/top_producers_products.sql- Top 50 producers by revenue (2021) and top 2 products by producer
- ETL Script:
src/etl.py- PySpark script that processes raw CSV files and generates the GMV Daily Snapshot dataset - DDL Script:
sql/ddl/create_gmv_daily_snapshot.sql- Table structure definition for GMV Daily Snapshot - Populated Dataset Example:
notebooks/show_gmv_daily_snapshot.ipynb- Jupyter Notebook showing the final dataset with sample data - SQL Queries:
sql/queries/daily_gmv_subsidiary.sql- Three queries on the final GMV Daily Snapshot dataset
-
Raw Data (
data/raw/): Contains three CSV files with source data:product_item.csvpurchase.csvpurchase_extra_info.csv
-
ETL Process (
src/etl.py): Processes the raw CSV files and applies transformations to create the GMV Daily Snapshot. -
Curated Data (
data/curated/gmv_daily_snapshot/): Contains Parquet files generated by the ETL script, representing the final processed dataset.
- Python 3.x
- PySpark
- Jupyter Notebook (for viewing the example notebook)
- Install dependencies:
pip install -r requirements.txt- Run the ETL script to generate the curated dataset:
python src/etl.py- View the populated dataset example:
jupyter notebook notebooks/show_gmv_daily_snapshot.ipynbrequirements.txt: Contains all Python dependencies needed to run the project, including PySpark, pandas, and Jupyter-related packages.