Hotmart Technical Case

This repository contains the solution for a technical case study, organized by exercises and deliverables.

Tech Stack Documentation

View Technical Documentation on Notion

A detailed description of the technology stack that enables this solution, including project context, data schema, and modeling decisions.

Project Structure

hotmart/
├── data/
│   ├── raw/                              # Raw input data (CSV files)
│   │   ├── product_item.csv
│   │   ├── purchase.csv
│   │   └── purchase_extra_info.csv
│   └── curated/                          # Processed data (Parquet files)
│       └── gmv_daily_snapshot/           # Final dataset generated by ETL
├── notebooks/
│   └── show_gmv_daily_snapshot.ipynb     # Example of populated final dataset
├── sql/
│   ├── ddl/                              # Data Definition Language scripts
│   │   └── create_gmv_daily_snapshot.sql
│   └── queries/                          # SQL queries
│       ├── top_producers_products.sql
│       └── daily_gmv_subsidiary.sql
├── src/
│   └── etl.py                            # ETL script
├── requirements.txt                      # Python dependencies
└── README.md

Exercises and Deliverables

Exercise 1: SQL Queries

File: sql/queries/top_producers_products.sql - Top 50 producers by revenue (2021) and top 2 products by producer

Exercise 2: Data Modeling and Development

ETL Script: src/etl.py - PySpark script that processes raw CSV files and generates the GMV Daily Snapshot dataset
DDL Script: sql/ddl/create_gmv_daily_snapshot.sql - Table structure definition for GMV Daily Snapshot
Populated Dataset Example: notebooks/show_gmv_daily_snapshot.ipynb - Jupyter Notebook showing the final dataset with sample data
SQL Queries: sql/queries/daily_gmv_subsidiary.sql - Three queries on the final GMV Daily Snapshot dataset

Data Flow

Raw Data (data/raw/): Contains three CSV files with source data:
- product_item.csv
- purchase.csv
- purchase_extra_info.csv
ETL Process (src/etl.py): Processes the raw CSV files and applies transformations to create the GMV Daily Snapshot.
Curated Data (data/curated/gmv_daily_snapshot/): Contains Parquet files generated by the ETL script, representing the final processed dataset.

Setup

Prerequisites

Python 3.x
PySpark
Jupyter Notebook (for viewing the example notebook)

Installation

Install dependencies:

pip install -r requirements.txt

Run the ETL script to generate the curated dataset:

python src/etl.py

View the populated dataset example:

jupyter notebook notebooks/show_gmv_daily_snapshot.ipynb

Additional Files

requirements.txt: Contains all Python dependencies needed to run the project, including PySpark, pandas, and Jupyter-related packages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hotmart Technical Case

Tech Stack Documentation

Project Structure

Exercises and Deliverables

Exercise 1: SQL Queries

Exercise 2: Data Modeling and Development

Data Flow

Setup

Prerequisites

Installation

Additional Files

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
notebooks		notebooks
sql		sql
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

amaralnt/hotmart-case

Folders and files

Latest commit

History

Repository files navigation

Hotmart Technical Case

Tech Stack Documentation

Project Structure

Exercises and Deliverables

Exercise 1: SQL Queries

Exercise 2: Data Modeling and Development

Data Flow

Setup

Prerequisites

Installation

Additional Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages