Welcome to the Apache Spark Studies & Simulations repository! This project was designed entirely for didactic purposes, focused on teaching practically the fundamentals of the Spark engine, how it processes data under the hood, and how to apply the main optimization techniques.
If you want to move away from theory and see how to deal with Data Skew, analyze Query Plans, and solve Out Of Memory (OOM) problems, you are in the right place!
To make the simulations as close to a production reality as possible, we use real and voluminous financial data.
- Database: Lending Club Loans
- Source: Kaggle - All Lending Club Loan Data
- Storage Path:
data/financial/lending_club/ - Main Files:
accepted_2007_to_2018Q4.csv: Contains accepted loan data, totaling ~151 columns (e.g.,loan_amnt,term,int_rate,grade,home_ownership,annual_inc,loan_status, etc.).rejected_2007_to_2018Q4.csv: Rejected loan data (e.g.,Amount Requested,Risk_Score,Debt-To-Income Ratio).
- How to get it: The manager script itself (
manage_studies.sh) will automatically prompt you to download the base by executingscripts/download_financial_data.py.
Our "track" is divided into Themes. Each theme has a script focused on teaching an isolated concept using our financial dataset:
| Phase | Theme | What will you learn in practice? | Key Production Nuances Covered |
|---|---|---|---|
| 1. The Engine | Theme 01 | Execution Plans: Understanding Logical/Physical Plans, Pushdown Predicates, and how to read an execution tree. | Reading Stack Traces and initial bottlenecks. |
| Theme 02 | Classic Error Analysis: What happens when there is an AnalysisException, UDF failure, Driver OOM, Executor OOM, or Serialization error. |
Runtime troubleshooting. | |
| Theme 03 | DAGs and Shuffles: The practical difference between Narrow and Wide transformations. | Shuffle cost and partition counting. | |
| 2. Infra & Basics | Theme 04 | Executor Tuning: Fat (many cores) vs Thin (few cores) executors. How memory and concurrency behave. | JVM Overhead vs Thread Contention. |
| Theme 05 | Caching Strategies: The impact of performing operations with cache() versus recalculating data from scratch. |
Serialization cost and Spill risk to disk. | |
| 3. Data Skew | Theme 06 | The Problem and The Cure (AQE): Effects of skewed data causing Stragglers and how Adaptive Query Execution (AQE) mitigates this. | Real parallelism (Cores) and Broadcast limits. |
| 4. Advanced | Theme 07 | Dynamic Partition Pruning (DPP): How Spark actively prunes partitions at runtime between joins. | Partitioning strategy and Small Files problem. |
| Theme 08 | Bucketing: Pre-shuffling written data to optimize future heavy reads. | High write cost vs read gain trade-off. | |
| Theme 09 | Salting: The classic, manual technique to resolve Data Skew when AQE is not enough. | Data Explosion trade-offs. |
To keep the environment standardized, isolated, and to avoid complex local installations, we use containers:
- Docker and Docker Compose: Essential to run our customized Spark cluster (Master and Workers).
- Python 3: Only locally if you want to manually download the data via our script
scripts/download_financial_data.py. - Bash: Bash environment (Linux/MacOS/WSL) to run the interactive project manager.
Forget the manual work of managing containers, opening and closing ports. We created a fully interactive Studies Manager (DevOps Script)!
It creates the right containers for the right scenarios (e.g., Thin vs Fat cluster), requests data download, and executes scripts via spark-submit, all through a guided and friendly menu.
chmod +x manage_studies.sh./manage_studies.sh- If the data does not exist, the terminal will ask:
"Do you want to download it now? (y/n)". Typey. - Next, choose which Theme you want to study (from 1 to 9).
- The script will start the isolated Spark cluster in Docker.
- The script will ask if you want to run the study's Job. Choose
y. - Read the didactic warnings that will be printed directly in the terminal by the running job and observe the learnings happen live!
The script itself will provide the exact link depending on the configuration. You can access it in the browser to investigate:
- Master UI (General Status): http://localhost:8080
- Spark History Server (Completed Application Logs): http://localhost:18080 (or ports 18081/18082 depending if it's Fat/Thin).
Study Tip: On the History Server, go to the SQL tab to visually see DAGs and Shuffles. Access the Stages tab to identify Stragglers (long green bars that stall execution)!
Happy studying! 🚀