GitHub - stesilva/ScholAmigo: ScholAmigo is a platform designed to help students easily discover and match with scholarship opportunities. It provides tools for tracking deadlines, getting personalized recommendations, connecting with alumni, all in one place. Our goal is to make the scholarship search and application process more efficient.

To reproduce the project, follow the instructions below.

Prerequisites

Install Docker:
- Download Docker from https://www.docker.com/.
- Verify the installation:

docker --version

Install Docker Compose:
- Included with Docker Desktop for Mac and Windows.
- Verify the installation:

docker-compose --version

Clone the repository:

git clone https://github.com/stesilva/ScholAmigo
cd ScholAmigo

Step-by-Step Instructions

Notes for Windows Users

It is possible that chromium:arm64 and chromium-driver:arm64 in the Dockerfile will not work for Windows; therefore, Chrome and Chrome driver will need to be installed manually.
After that, the script for scraping_daad.py needs to be changed with the relevant paths:

options.binary_location = r"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe"
...
driver = webdriver.Chrome(executable_path=r"C:\chromedriver\chromedriver.exe", options=options)

Step 1: Create Required Directories

For Linux-based OS run the following command to create directories for Airflow:

mkdir -p ./dags ./logs ./plugins ./aws ./scripts ./sql ./outputs
echo -e "AIRFLOW_UID=$(id -u)" > .env

For other operating systems, you may get a warning that AIRFLOW_UID is not set, but you can safely ignore it. You can also manually create an .env file in the same folder as docker-compose.yaml with this content to get rid of the warning:

AIRFLOW_UID=50000

Step 2: Place Files in the Appropriate Folders

Copy DAG files → Place them in the ./dags folder.
Create folder called aws and place config and credentials files (sent by email)
Final structure of the directory should look like this:

ScholAmigo/
├── dags/
│   ├── airflow_batch_daad.py
│   ├── ...
├── logs/
├── plugins/
├── aws/
│   ├── config
│   ├── credentials
├── scripts/
│   ├── entrypoint.sh
│   ├── trusted_zone_daad.py
│   ├── example_script.py
├── sql/
│   ├── create_queries.sql
│   ├── insert_queries.sql
├── outputs/
│   ├── ...
├── kafka_consumer.py
├── requirements.txt
├── docker-compose.yaml
├── Dockerfile
├── Dockerfile-consumer

Step 3: Build and Run Docker Compose

Navigate to the main folder (with docker-compose.yaml) and run these commands to build and start your Airflow environment:

docker-compose build
docker-compose up airflow-init
docker-compose up

Step 4: Access Airflow

Open your browser and navigate to http://localhost:8080.
Use the following credentials to log in:
- Username: airflow
- Password: airflow
Manually trigger the DAGs for demonstration

Step 5: Access Kafka Messages

To visually see messages produced by Kafka producers, open http://localhost:9021.

Step 6: Access Neo4j Graph

To visualize the graph that is loaded after triggering the DAG 'load_neo4j_data', open http://localhost:7474.

Additional Notes

Spark transformations will need to be executed outside the container.
To replicate, you will need Spark and Hadoop jars installed locally.
Airflow did not work with Spark in this setup. As a workaround, for Mac users, you can schedule the example script for the DAAD trusted zone (under the scripts folder) using a cron job. This demonstrates that scheduling is possible even without Airflow.
Define API keys for AWS, Pinecone, and Gemini in the appropriate configuration files or environment variables.

Troubleshooting

Ensure sufficient system resources are allocated to Docker (Airflow recommends allocating 10GB of RAM for Docker, according to instructions provided in this link: https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#fetching-docker-compose-yaml).

Example Output

In the folder 'outputs', we present the generated files from running the pipeline. These files demonstrate the structure of the data after it has been extracted from the data sources. Furthermore, these files are also stored in the Amazon S3 buckets created for this project.

Exploitation Zone Applications

To execute some of the exploitation zone applications, you can:

Run the files user_alumni_recommendation or user_analytics as examples. These scripts demonstrate how to use the processed data for recommendations and analytics.
Trigger 'stream' DAG to produce Kafka messages, and observe Kafka consumer's ouputs on the Terminal, where you can see messages with scholarship recommendations whenever consumer receives 'Save' button clicks.

Architecture Design

The ScholAmigo architecture is designed to efficiently collect, process, and utilize diverse data sources to recommend scholarships to students.

Data is ingested from scholarship sources and simulated LinkedIn/user activity using web scraping and Python scripts, orchestrated by Airflow. These data streams are sent via Kafka for real-time processing and stored in S3 ingestion buckets (Landing Zone).

Batch processes using Apache Spark transform and validate data, moving it to trusted S3 buckets (Trusted Zone). Further processing prepares data for the Exploitation Zone, where various systems power the recommendation engine:

A PostgreSQL database supports real-time queries.
Redis enables real-time recommendations by caching recent user clickstream data.
Neo4j and Pinecone provide graph-based and embedding-based recommendations for peer and alumni insights.

The architecture supports both real-time and batch processing, ensuring up-to-date and relevant scholarship matches for users.

Final Notes

If you encounter any issues, feel free to open an issue on the GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
assets		assets
cypher		cypher
dags		dags
outputs		outputs
prompts		prompts
quality_checks		quality_checks
scripts		scripts
sql		sql
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile-api		Dockerfile-api
Dockerfile-consumer		Dockerfile-consumer
README.md		README.md
api.py		api.py
docker-compose.yaml		docker-compose.yaml
kafka_consumer.py		kafka_consumer.py
recommendation_system.py		recommendation_system.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prerequisites

Step-by-Step Instructions

Notes for Windows Users

Step 1: Create Required Directories

Step 2: Place Files in the Appropriate Folders

Step 3: Build and Run Docker Compose

Step 4: Access Airflow

Step 5: Access Kafka Messages

Step 6: Access Neo4j Graph

Additional Notes

Troubleshooting

Example Output

Exploitation Zone Applications

Architecture Design

Final Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

stesilva/ScholAmigo

Folders and files

Latest commit

History

Repository files navigation

Prerequisites

Step-by-Step Instructions

Notes for Windows Users

Step 1: Create Required Directories

Step 2: Place Files in the Appropriate Folders

Step 3: Build and Run Docker Compose

Step 4: Access Airflow

Step 5: Access Kafka Messages

Step 6: Access Neo4j Graph

Additional Notes

Troubleshooting

Example Output

Exploitation Zone Applications

Architecture Design

Final Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages