From 8aece7f824f7c5c2010de280f28e79889c627fab Mon Sep 17 00:00:00 2001 From: kashish yadav <51848533+kashishyadav@users.noreply.github.com> Date: Tue, 30 Sep 2025 23:26:48 -0500 Subject: [PATCH] Update README.md --- README.md | 81 ++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 80 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index dcd483a..f442b67 100644 --- a/README.md +++ b/README.md @@ -1 +1,80 @@ -# Pyspark-With-Python \ No newline at end of file +# Pyspark-With-Python +πŸ”Ή What is PySpark? +PySpark is the Python API for Apache Spark, a distributed computing framework used to process large-scale data efficiently. +If you’re dealing with gigabytes, terabytes, or petabytes of data, Spark makes it possible to process it quickly using clusters of machines instead of a single computer. + +πŸ”Ή Why PySpark? +Speed: Spark processes data in memory, making it faster than traditional MapReduce. +Scalability: Can run on a laptop, a cluster, or the cloud. +Flexibility: APIs available in Python, Scala, Java, and R. +Libraries: Spark SQL, MLlib (Machine Learning), GraphX, Structured Streaming. + +Spark Architecture (Simplified) +Driver: Coordinates the execution of jobs. +Cluster Manager: Manages resources. +Executors: Workers that process the data. +Driver Program β†’ Cluster Manager β†’ Executors + (Coordinator) (Resource Manager) (Workers doing actual tasks) +Imagine you are organizing a wedding πŸŽ‰ + +Driver (Project Manager / Wedding Planner): +Plans all tasks (food, decoration, music). + +Cluster Manager (HR/Coordinator): +Decides how many cooks, decorators, and musicians you can hire based on budget. + +Worker Nodes (Teams): + +Cooks (Executor 1) prepare food. + +Decorators (Executor 2) decorate the hall. + +Musicians (Executor 3) manage music. + +The Driver (wedding planner) coordinates with Cluster Manager (HR) β†’ who assigns tasks to actual Workers (teams) β†’ and results come back as a successful wedding (final program output). + +πŸ”Ή Diagram (Text-based) + + | Driver | + | (SparkContext, DAG) | + +-----------+-----------+ + | + v + +-----------------------+ + | Cluster Manager | <-- Manages resources + +-----------+-----------+ + | + ----------------------------------------------- + | | ++---------------+ +---------------+ +| Worker Node 1| | Worker Node 2 | +| Executors | | Executors | +| (Tasks) | | (Tasks) | ++---------------+ +---------------+ +πŸ”Ή First PySpark Program +from pyspark.sql import SparkSession +# Step 1: Start SparkSession +spark = SparkSession.builder \ + .appName("PySparkIntro") \ + .getOrCreate() +# Step 2: Create a simple DataFrame +df = spark.range(1, 6) +df.show() +Output: + ++---+ +| id| ++---+ +| 1| +| 2| +| 3| +| 4| +| 5| ++---+ +Key Takeaway +PySpark = Spark + Python API. + +Built for big data processing. + +First step in any PySpark job β†’ Create a SparkSession. +