🚀 Apache Spark Studies & Simulations

Welcome to the Apache Spark Studies & Simulations repository! This project was designed entirely for didactic purposes, focused on teaching practically the fundamentals of the Spark engine, how it processes data under the hood, and how to apply the main optimization techniques.

If you want to move away from theory and see how to deal with Data Skew, analyze Query Plans, and solve Out Of Memory (OOM) problems, you are in the right place!

📊 1. Data Used

To make the simulations as close to a production reality as possible, we use real and voluminous financial data.

Database: Lending Club Loans
Source: Kaggle - All Lending Club Loan Data
Storage Path: data/financial/lending_club/
Main Files:
- accepted_2007_to_2018Q4.csv: Contains accepted loan data, totaling ~151 columns (e.g., loan_amnt, term, int_rate, grade, home_ownership, annual_inc, loan_status, etc.).
- rejected_2007_to_2018Q4.csv: Rejected loan data (e.g., Amount Requested, Risk_Score, Debt-To-Income Ratio).
How to get it: The manager script itself (manage_studies.sh) will automatically prompt you to download the base by executing scripts/download_financial_data.py.

🧠 2. Exercises - What are the Learnings?

Our "track" is divided into Themes. Each theme has a script focused on teaching an isolated concept using our financial dataset:

Phase	Theme	What will you learn in practice?	Key Production Nuances Covered
1. The Engine	Theme 01	Execution Plans: Understanding Logical/Physical Plans, Pushdown Predicates, and how to read an execution tree.	Reading Stack Traces and initial bottlenecks.
	Theme 02	Classic Error Analysis: What happens when there is an `AnalysisException`, UDF failure, Driver OOM, Executor OOM, or Serialization error.	Runtime troubleshooting.
	Theme 03	DAGs and Shuffles: The practical difference between Narrow and Wide transformations.	Shuffle cost and partition counting.
2. Infra & Basics	Theme 04	Executor Tuning: Fat (many cores) vs Thin (few cores) executors. How memory and concurrency behave.	JVM Overhead vs Thread Contention.
	Theme 05	Caching Strategies: The impact of performing operations with `cache()` versus recalculating data from scratch.	Serialization cost and Spill risk to disk.
3. Data Skew	Theme 06	The Problem and The Cure (AQE): Effects of skewed data causing Stragglers and how Adaptive Query Execution (AQE) mitigates this.	Real parallelism (Cores) and Broadcast limits.
4. Advanced	Theme 07	Dynamic Partition Pruning (DPP): How Spark actively prunes partitions at runtime between joins.	Partitioning strategy and Small Files problem.
	Theme 08	Bucketing: Pre-shuffling written data to optimize future heavy reads.	High write cost vs read gain trade-off.
	Theme 09	Salting: The classic, manual technique to resolve Data Skew when AQE is not enough.	Data Explosion trade-offs.

⚙️ 3. Requirements to Run

To keep the environment standardized, isolated, and to avoid complex local installations, we use containers:

Docker and Docker Compose: Essential to run our customized Spark cluster (Master and Workers).
Python 3: Only locally if you want to manually download the data via our script scripts/download_financial_data.py.
Bash: Bash environment (Linux/MacOS/WSL) to run the interactive project manager.

▶️ 4. How to Run

Forget the manual work of managing containers, opening and closing ports. We created a fully interactive Studies Manager (DevOps Script)!

It creates the right containers for the right scenarios (e.g., Thin vs Fat cluster), requests data download, and executes scripts via spark-submit, all through a guided and friendly menu.

Step 1: Give Execution Permission (Only once)

chmod +x manage_studies.sh

Step 2: Run the Studies Manager

./manage_studies.sh

Step 3: Follow the Interactive Menu

If the data does not exist, the terminal will ask: "Do you want to download it now? (y/n)". Type y.
Next, choose which Theme you want to study (from 1 to 9).
The script will start the isolated Spark cluster in Docker.
The script will ask if you want to run the study's Job. Choose y.
Read the didactic warnings that will be printed directly in the terminal by the running job and observe the learnings happen live!

🕵️ Monitoring Spark (UI Interfaces)

The script itself will provide the exact link depending on the configuration. You can access it in the browser to investigate:

Master UI (General Status): http://localhost:8080
Spark History Server (Completed Application Logs): http://localhost:18080 (or ports 18081/18082 depending if it's Fat/Thin).

Study Tip: On the History Server, go to the SQL tab to visually see DAGs and Shuffles. Access the Stages tab to identify Stragglers (long green bars that stall execution)!

Happy studying! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
scripts		scripts
theme_01_query_plans		theme_01_query_plans
theme_02_error_analysis		theme_02_error_analysis
theme_03_dags		theme_03_dags
theme_04_executor_tuning		theme_04_executor_tuning
theme_05_caching		theme_05_caching
theme_06_data_skew		theme_06_data_skew
theme_07_dpp		theme_07_dpp
theme_08_bucketing		theme_08_bucketing
theme_09_salting		theme_09_salting
.gitignore		.gitignore
README.md		README.md
docker-compose.fat.yml		docker-compose.fat.yml
docker-compose.thin.yml		docker-compose.thin.yml
docker-compose.yml		docker-compose.yml
manage_studies.sh		manage_studies.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Apache Spark Studies & Simulations

📊 1. Data Used

🧠 2. Exercises - What are the Learnings?

⚙️ 3. Requirements to Run

▶️ 4. How to Run

Step 1: Give Execution Permission (Only once)

Step 2: Run the Studies Manager

Step 3: Follow the Interactive Menu

🕵️ Monitoring Spark (UI Interfaces)

About

Uh oh!

Releases

Packages

Languages

MarcosWinicyus/apache-spark-studies

Folders and files

Latest commit

History

Repository files navigation

🚀 Apache Spark Studies & Simulations

📊 1. Data Used

🧠 2. Exercises - What are the Learnings?

⚙️ 3. Requirements to Run

▶️ 4. How to Run

Step 1: Give Execution Permission (Only once)

Step 2: Run the Studies Manager

Step 3: Follow the Interactive Menu

🕵️ Monitoring Spark (UI Interfaces)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages