Skip to content

Didactic simulations to master Apache Spark. Learn Query Plans, DAGs, Data Skew resolution, and performance tuning with hands-on exercises and real financial data.

Notifications You must be signed in to change notification settings

MarcosWinicyus/apache-spark-studies

Repository files navigation

🚀 Apache Spark Studies & Simulations

Welcome to the Apache Spark Studies & Simulations repository! This project was designed entirely for didactic purposes, focused on teaching practically the fundamentals of the Spark engine, how it processes data under the hood, and how to apply the main optimization techniques.

If you want to move away from theory and see how to deal with Data Skew, analyze Query Plans, and solve Out Of Memory (OOM) problems, you are in the right place!


📊 1. Data Used

To make the simulations as close to a production reality as possible, we use real and voluminous financial data.

  • Database: Lending Club Loans
  • Source: Kaggle - All Lending Club Loan Data
  • Storage Path: data/financial/lending_club/
  • Main Files:
    • accepted_2007_to_2018Q4.csv: Contains accepted loan data, totaling ~151 columns (e.g., loan_amnt, term, int_rate, grade, home_ownership, annual_inc, loan_status, etc.).
    • rejected_2007_to_2018Q4.csv: Rejected loan data (e.g., Amount Requested, Risk_Score, Debt-To-Income Ratio).
  • How to get it: The manager script itself (manage_studies.sh) will automatically prompt you to download the base by executing scripts/download_financial_data.py.

🧠 2. Exercises - What are the Learnings?

Our "track" is divided into Themes. Each theme has a script focused on teaching an isolated concept using our financial dataset:

Phase Theme What will you learn in practice? Key Production Nuances Covered
1. The Engine Theme 01 Execution Plans: Understanding Logical/Physical Plans, Pushdown Predicates, and how to read an execution tree. Reading Stack Traces and initial bottlenecks.
Theme 02 Classic Error Analysis: What happens when there is an AnalysisException, UDF failure, Driver OOM, Executor OOM, or Serialization error. Runtime troubleshooting.
Theme 03 DAGs and Shuffles: The practical difference between Narrow and Wide transformations. Shuffle cost and partition counting.
2. Infra & Basics Theme 04 Executor Tuning: Fat (many cores) vs Thin (few cores) executors. How memory and concurrency behave. JVM Overhead vs Thread Contention.
Theme 05 Caching Strategies: The impact of performing operations with cache() versus recalculating data from scratch. Serialization cost and Spill risk to disk.
3. Data Skew Theme 06 The Problem and The Cure (AQE): Effects of skewed data causing Stragglers and how Adaptive Query Execution (AQE) mitigates this. Real parallelism (Cores) and Broadcast limits.
4. Advanced Theme 07 Dynamic Partition Pruning (DPP): How Spark actively prunes partitions at runtime between joins. Partitioning strategy and Small Files problem.
Theme 08 Bucketing: Pre-shuffling written data to optimize future heavy reads. High write cost vs read gain trade-off.
Theme 09 Salting: The classic, manual technique to resolve Data Skew when AQE is not enough. Data Explosion trade-offs.

⚙️ 3. Requirements to Run

To keep the environment standardized, isolated, and to avoid complex local installations, we use containers:

  1. Docker and Docker Compose: Essential to run our customized Spark cluster (Master and Workers).
  2. Python 3: Only locally if you want to manually download the data via our script scripts/download_financial_data.py.
  3. Bash: Bash environment (Linux/MacOS/WSL) to run the interactive project manager.

▶️ 4. How to Run

Forget the manual work of managing containers, opening and closing ports. We created a fully interactive Studies Manager (DevOps Script)!

It creates the right containers for the right scenarios (e.g., Thin vs Fat cluster), requests data download, and executes scripts via spark-submit, all through a guided and friendly menu.

Step 1: Give Execution Permission (Only once)

chmod +x manage_studies.sh

Step 2: Run the Studies Manager

./manage_studies.sh

Step 3: Follow the Interactive Menu

  1. If the data does not exist, the terminal will ask: "Do you want to download it now? (y/n)". Type y.
  2. Next, choose which Theme you want to study (from 1 to 9).
  3. The script will start the isolated Spark cluster in Docker.
  4. The script will ask if you want to run the study's Job. Choose y.
  5. Read the didactic warnings that will be printed directly in the terminal by the running job and observe the learnings happen live!

🕵️ Monitoring Spark (UI Interfaces)

The script itself will provide the exact link depending on the configuration. You can access it in the browser to investigate:

Study Tip: On the History Server, go to the SQL tab to visually see DAGs and Shuffles. Access the Stages tab to identify Stragglers (long green bars that stall execution)!

Happy studying! 🚀

About

Didactic simulations to master Apache Spark. Learn Query Plans, DAGs, Data Skew resolution, and performance tuning with hands-on exercises and real financial data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published