curly-waffle-spark

The "spark" in the name represents my exploration of Apache Spark (along with its libraries). This repository captures that. My successful MongoDB to SparkSQL setup is in setup - success.ipynb

Next Steps

Figure out what kind of data I'd like to scrape (maybe basketball/sports data?)
Start pipeline for data scraping (such as a script that executes on a remote server once a day on AWS Lambda)
Map out system design for the scraped data (data to AWS S3)
Set up transformation job in PySpark and use AWS Glue to trigger the transformation when data sent into S3
When transformation done, send the pre-processed data into MongoDB database - likely Cluster0. Pre-existing setup in setup - success.ipynb is MongoDB $\rightarrow$ PySpark, but will need to establish the other way around, or PySpark $\rightarrow$ MongoDB
Establish analytics, create dashboards/websites with all the findings. Ideas vary, can do more complicated ML if using word/image-based data.
Deploy solution using AWS EC2 or Vercel, connecting to the backend using a lightweight API

Gemini Response for The Pipeline

I asked Google Gemini for how I could set up the pipeline, and it created a comprehensive, AWS-based solution that goes from scraping procedure all the way to deployment. The details are more in depth compared to above, and are in gemini-response.md

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
Loan_approval_data_2025.csv		Loan_approval_data_2025.csv
README.md		README.md
gemini-response.md		gemini-response.md
setup - failure.ipynb		setup - failure.ipynb
setup - success.ipynb		setup - success.ipynb
spark by itself.ipynb		spark by itself.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

curly-waffle-spark

Next Steps

Gemini Response for The Pipeline

About

Uh oh!

Releases

Packages

Languages

License

ShubhanC/curly-waffle-spark

Folders and files

Latest commit

History

Repository files navigation

curly-waffle-spark

Next Steps

Gemini Response for The Pipeline

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages