Skip to content

amaralnt/hotmart-case

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hotmart Technical Case

This repository contains the solution for a technical case study, organized by exercises and deliverables.

Tech Stack Documentation

View Technical Documentation on Notion

A detailed description of the technology stack that enables this solution, including project context, data schema, and modeling decisions.

Project Structure

hotmart/
├── data/
│   ├── raw/                              # Raw input data (CSV files)
│   │   ├── product_item.csv
│   │   ├── purchase.csv
│   │   └── purchase_extra_info.csv
│   └── curated/                          # Processed data (Parquet files)
│       └── gmv_daily_snapshot/           # Final dataset generated by ETL
├── notebooks/
│   └── show_gmv_daily_snapshot.ipynb     # Example of populated final dataset
├── sql/
│   ├── ddl/                              # Data Definition Language scripts
│   │   └── create_gmv_daily_snapshot.sql
│   └── queries/                          # SQL queries
│       ├── top_producers_products.sql
│       └── daily_gmv_subsidiary.sql
├── src/
│   └── etl.py                            # ETL script
├── requirements.txt                      # Python dependencies
└── README.md

Exercises and Deliverables

Exercise 1: SQL Queries

  • File: sql/queries/top_producers_products.sql - Top 50 producers by revenue (2021) and top 2 products by producer

Exercise 2: Data Modeling and Development

  • ETL Script: src/etl.py - PySpark script that processes raw CSV files and generates the GMV Daily Snapshot dataset
  • DDL Script: sql/ddl/create_gmv_daily_snapshot.sql - Table structure definition for GMV Daily Snapshot
  • Populated Dataset Example: notebooks/show_gmv_daily_snapshot.ipynb - Jupyter Notebook showing the final dataset with sample data
  • SQL Queries: sql/queries/daily_gmv_subsidiary.sql - Three queries on the final GMV Daily Snapshot dataset

Data Flow

  1. Raw Data (data/raw/): Contains three CSV files with source data:

    • product_item.csv
    • purchase.csv
    • purchase_extra_info.csv
  2. ETL Process (src/etl.py): Processes the raw CSV files and applies transformations to create the GMV Daily Snapshot.

  3. Curated Data (data/curated/gmv_daily_snapshot/): Contains Parquet files generated by the ETL script, representing the final processed dataset.

Setup

Prerequisites

  • Python 3.x
  • PySpark
  • Jupyter Notebook (for viewing the example notebook)

Installation

  1. Install dependencies:
pip install -r requirements.txt
  1. Run the ETL script to generate the curated dataset:
python src/etl.py
  1. View the populated dataset example:
jupyter notebook notebooks/show_gmv_daily_snapshot.ipynb

Additional Files

  • requirements.txt: Contains all Python dependencies needed to run the project, including PySpark, pandas, and Jupyter-related packages.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published