NYC Rideshare Analysis

Overview

This GitHub repository contains the project files for the NYC Rideshare Analysis, where we applied Apache Spark to analyze a comprehensive dataset of Uber and Lyft rides from January 1, 2023, to May 31, 2023. The analysis focuses on uncovering insights about ride frequencies, driver earnings, passenger waiting times, and route popularity to inform operational strategies for rideshare companies.

Dataset

The analysis utilizes two primary datasets provided by the NYC Taxi and Limousine Commission (TLC), distributed under the MIT license:

rideshare_data.csv - Contains detailed information on individual rideshare trips.
taxi_zone_lookup.csv - Provides mapping information for taxi zones mentioned in the rideshare dataset.

Dataset Accessibility

Source and Licensing: The datasets are sourced from the NYC Taxi and Limousine Commission (TLC) Trip Record Data, available under the MIT license. More details and data can be accessed here.

Schema Overview

rideshare_data.csv

business: Uber or Lyft
pickup_location: Taxi zone ID where the trip started
dropoff_location: Taxi zone ID where the trip ended
trip_length: Distance in miles
request_to_pickup: Time from ride request to pickup in seconds
total_ride_time: Total ride duration in seconds
date: Date of the ride in UNIX timestamp

taxi_zone_lookup.csv

LocationID: Numeric ID corresponding to taxi zones
Borough: NYC borough
Zone: Specific area within the borough
service_zone: Service area category

Project Structure

This project consists of several analytical tasks implemented using Spark:

Task 1: Merging Datasets - Integrating rideshare data with taxi zone information.
Task 2: Aggregation of Data - Analyzing trip counts, platform profits, and driver earnings.
Task 3: Top-K Processing - Identifying top pickup and dropoff boroughs and busiest routes.
Task 4: Average of Data - Calculating average earnings, trip lengths, and earnings per mile by time of day.
Task 5: Finding Anomalies - Examining anomalies in waiting times during January.
Task 6: Filtering Data - Filtering data to find specific trip count ranges and routes between boroughs.
Task 7: Routes Analysis - Determining the most popular routes based on total trip counts.

Project File Structure

The project is structured into separate scripts, Jupyter notebooks, and directories for each analytical task, outputs, and CSV files. Below is a detailed breakdown of the files corresponding to each task:

Task 1: Merging Datasets

Merging_Datasets_01.py: Script for merging the rideshare and taxi zone lookup datasets. This script must be run first as it prepares the data necessary for all subsequent analyses.

Task 2: Aggregation of Data

Aggregation_of_Data_02.py: Script for aggregating data to calculate trip counts, platform profits, and driver earnings.
Aggregation_of_Data_Visualisation_02.ipynb: Jupyter notebook used for visualizing the aggregated data.

Task 3: Top-K Processing

Top_K_Processing_03.py: Script for identifying the top pickup and dropoff boroughs and the busiest routes.

Task 4: Average of Data

Average_of_Data_04.py: Script for calculating average earnings, trip lengths, and earnings per mile by time of day.

Task 5: Finding Anomalies

Finding_anomalies_05.py: Script for detecting anomalies in waiting times during January.
Finding_anomalies_Visualisation_05.ipynb: Jupyter notebook used for visualizing the anomalies in waiting times.

Task 6: Filtering Data

Filtering_Data_06.py: Script for filtering data to find specific trip count ranges and routes between boroughs.

Task 7: Routes Analysis

Routes_Analysis_07.py: Script for analyzing the most popular routes based on total trip counts.

Output Screenshots

Screenshots of outputs are organized by task in the output_screenshot directory.

Data Export Folders

Organized folders containing CSV outputs from specific tasks:

Aggregation_of_Data_csv_output_02
- total_earnings/
- total_profit/
- trip_count/
Finding_anomalies_csv_output_05

These folders contain all CSV files generated as outputs from the scripts, which are used for further analysis or visualization.

How to Use the Outputs

CSV files stored in the data export folders can be used to replicate the visualizations or to perform further analysis. Screenshots in the output folders can be used to quickly verify and compare the results obtained.

Methods and APIs Used

In this project, several Apache Spark functions and additional APIs were utilized to manipulate data, perform analyses, and prepare outputs for visualization. Below are key methods and their applications across different tasks:

Data Manipulation and Analysis

Spark SQL Functions: Used for filtering, aggregating, and transforming

datasets. Functions such as groupBy, agg, sum, avg, and orderBy were instrumental in performing complex data manipulations.

DataFrame Operations: Including filter, select, withColumn, drop, used extensively to prepare and adjust data for analysis.

Data Exporting

Export to CSV: After processing the data with Spark, the results were exported to CSV files for visualization and sharing. This was achieved using:
- ccc method bucket ls: Command to list the contents of our S3 bucket, ensuring we targeted the correct dataset for export.
- ccc method bucket cp -r bkt:task-specific-folder output_folder: Command to copy processed data from our S3 bucket to a designated output location, making it accessible for further analysis in tools like Jupyter Notebook.

Visualization

Jupyter Notebook: Used for generating histograms and other visual representations of the data. After exporting the data to CSV, Jupyter Notebooks were employed to visualize trends and anomalies using libraries like Matplotlib and Seaborn.

API Integrations

S3 Bucket Integration: Integrated AWS S3 buckets for secure and scalable storage of raw and processed data, which was essential for handling large datasets efficiently.

Useful Resources

Key Findings

Market Dominance: Uber consistently outperforms Lyft in trip volume and profitability, especially in Manhattan.
Earnings Insights: Drivers earn more in the afternoon and on longer trips at night, suggesting optimal times for drivers to work.
Anomaly Detection: Notable increase in waiting times on New Year’s Day, likely due to increased demand.
Route Popularity: Specific routes like Brooklyn to Staten Island show unexpectedly high traffic, indicating areas for operational focus.

Conclusion

The analysis provided in this repository illustrates the power of big data in understanding and optimizing the rideshare industry. Insights gained from this project can help rideshare companies refine their strategies, improve driver satisfaction, and enhance customer service.

How to Use

To replicate this analysis:

Clone this repository.
Ensure Apache Spark and required libraries are installed.
Execute the provided Spark scripts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NYC Rideshare Analysis

Overview

Dataset

Dataset Accessibility

Schema Overview

rideshare_data.csv

taxi_zone_lookup.csv

Project Structure

Project File Structure

Task 1: Merging Datasets

Task 2: Aggregation of Data

Task 3: Top-K Processing

Task 4: Average of Data

Task 5: Finding Anomalies

Task 6: Filtering Data

Task 7: Routes Analysis

Output Screenshots

Data Export Folders

How to Use the Outputs

Methods and APIs Used

Data Manipulation and Analysis

Data Exporting

Visualization

API Integrations

Useful Resources

Key Findings

Conclusion

How to Use

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Aggregation_of_Data_csv_output_02		Aggregation_of_Data_csv_output_02
Finding_anomalies_csv_output_05		Finding_anomalies_csv_output_05
output_screenshot		output_screenshot
Aggregation_of_Data_02.py		Aggregation_of_Data_02.py
Aggregation_of_Data_Visualisation_02.ipynb		Aggregation_of_Data_Visualisation_02.ipynb
Average_of_Data_04.py		Average_of_Data_04.py
Filtering_Data_06.py		Filtering_Data_06.py
Finding_anomalies_05.py		Finding_anomalies_05.py
Finding_anomalies_Visualisation_05.ipynb		Finding_anomalies_Visualisation_05.ipynb
Merging_Datasets_01.py		Merging_Datasets_01.py
README.md		README.md
Routes_Analysis_07.py		Routes_Analysis_07.py
Top_K_Processing_03.py		Top_K_Processing_03.py

swapnilp9819/NYC-Rideshare-Analysis

Folders and files

Latest commit

History

Repository files navigation

NYC Rideshare Analysis

Overview

Dataset

Dataset Accessibility

Schema Overview

rideshare_data.csv

taxi_zone_lookup.csv

Project Structure

Project File Structure

Task 1: Merging Datasets

Task 2: Aggregation of Data

Task 3: Top-K Processing

Task 4: Average of Data

Task 5: Finding Anomalies

Task 6: Filtering Data

Task 7: Routes Analysis

Output Screenshots

Data Export Folders

How to Use the Outputs

Methods and APIs Used

Data Manipulation and Analysis

Data Exporting

Visualization

API Integrations

Useful Resources

Key Findings

Conclusion

How to Use

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages