This GitHub repository contains the project files for the NYC Rideshare Analysis, where we applied Apache Spark to analyze a comprehensive dataset of Uber and Lyft rides from January 1, 2023, to May 31, 2023. The analysis focuses on uncovering insights about ride frequencies, driver earnings, passenger waiting times, and route popularity to inform operational strategies for rideshare companies.
The analysis utilizes two primary datasets provided by the NYC Taxi and Limousine Commission (TLC), distributed under the MIT license:
rideshare_data.csv- Contains detailed information on individual rideshare trips.taxi_zone_lookup.csv- Provides mapping information for taxi zones mentioned in the rideshare dataset.
- Source and Licensing: The datasets are sourced from the NYC Taxi and Limousine Commission (TLC) Trip Record Data, available under the MIT license. More details and data can be accessed here.
- business: Uber or Lyft
- pickup_location: Taxi zone ID where the trip started
- dropoff_location: Taxi zone ID where the trip ended
- trip_length: Distance in miles
- request_to_pickup: Time from ride request to pickup in seconds
- total_ride_time: Total ride duration in seconds
- date: Date of the ride in UNIX timestamp
- LocationID: Numeric ID corresponding to taxi zones
- Borough: NYC borough
- Zone: Specific area within the borough
- service_zone: Service area category
This project consists of several analytical tasks implemented using Spark:
- Task 1: Merging Datasets - Integrating rideshare data with taxi zone information.
- Task 2: Aggregation of Data - Analyzing trip counts, platform profits, and driver earnings.
- Task 3: Top-K Processing - Identifying top pickup and dropoff boroughs and busiest routes.
- Task 4: Average of Data - Calculating average earnings, trip lengths, and earnings per mile by time of day.
- Task 5: Finding Anomalies - Examining anomalies in waiting times during January.
- Task 6: Filtering Data - Filtering data to find specific trip count ranges and routes between boroughs.
- Task 7: Routes Analysis - Determining the most popular routes based on total trip counts.
The project is structured into separate scripts, Jupyter notebooks, and directories for each analytical task, outputs, and CSV files. Below is a detailed breakdown of the files corresponding to each task:
- Merging_Datasets_01.py: Script for merging the rideshare and taxi zone lookup datasets. This script must be run first as it prepares the data necessary for all subsequent analyses.
- Aggregation_of_Data_02.py: Script for aggregating data to calculate trip counts, platform profits, and driver earnings.
- Aggregation_of_Data_Visualisation_02.ipynb: Jupyter notebook used for visualizing the aggregated data.
- Top_K_Processing_03.py: Script for identifying the top pickup and dropoff boroughs and the busiest routes.
- Average_of_Data_04.py: Script for calculating average earnings, trip lengths, and earnings per mile by time of day.
- Finding_anomalies_05.py: Script for detecting anomalies in waiting times during January.
- Finding_anomalies_Visualisation_05.ipynb: Jupyter notebook used for visualizing the anomalies in waiting times.
- Filtering_Data_06.py: Script for filtering data to find specific trip count ranges and routes between boroughs.
- Routes_Analysis_07.py: Script for analyzing the most popular routes based on total trip counts.
Screenshots of outputs are organized by task in the output_screenshot directory.
Organized folders containing CSV outputs from specific tasks:
- Aggregation_of_Data_csv_output_02
total_earnings/total_profit/trip_count/
- Finding_anomalies_csv_output_05
These folders contain all CSV files generated as outputs from the scripts, which are used for further analysis or visualization.
CSV files stored in the data export folders can be used to replicate the visualizations or to perform further analysis. Screenshots in the output folders can be used to quickly verify and compare the results obtained.
In this project, several Apache Spark functions and additional APIs were utilized to manipulate data, perform analyses, and prepare outputs for visualization. Below are key methods and their applications across different tasks:
- Spark SQL Functions: Used for filtering, aggregating, and transforming
datasets. Functions such as groupBy, agg, sum, avg, and orderBy were instrumental in performing complex data manipulations.
- DataFrame Operations: Including
filter,select,withColumn,drop, used extensively to prepare and adjust data for analysis.
- Export to CSV: After processing the data with Spark, the results were exported to CSV files for visualization and sharing. This was achieved using:
ccc method bucket ls: Command to list the contents of our S3 bucket, ensuring we targeted the correct dataset for export.ccc method bucket cp -r bkt:task-specific-folder output_folder: Command to copy processed data from our S3 bucket to a designated output location, making it accessible for further analysis in tools like Jupyter Notebook.
- Jupyter Notebook: Used for generating histograms and other visual representations of the data. After exporting the data to CSV, Jupyter Notebooks were employed to visualize trends and anomalies using libraries like Matplotlib and Seaborn.
- S3 Bucket Integration: Integrated AWS S3 buckets for secure and scalable storage of raw and processed data, which was essential for handling large datasets efficiently.
- Market Dominance: Uber consistently outperforms Lyft in trip volume and profitability, especially in Manhattan.
- Earnings Insights: Drivers earn more in the afternoon and on longer trips at night, suggesting optimal times for drivers to work.
- Anomaly Detection: Notable increase in waiting times on New Year’s Day, likely due to increased demand.
- Route Popularity: Specific routes like Brooklyn to Staten Island show unexpectedly high traffic, indicating areas for operational focus.
The analysis provided in this repository illustrates the power of big data in understanding and optimizing the rideshare industry. Insights gained from this project can help rideshare companies refine their strategies, improve driver satisfaction, and enhance customer service.
To replicate this analysis:
- Clone this repository.
- Ensure Apache Spark and required libraries are installed.
- Execute the provided Spark scripts.