To reproduce the project, follow the instructions below.
- Install Docker:
- Download Docker from https://www.docker.com/.
- Verify the installation:
docker --version- Install Docker Compose:
- Included with Docker Desktop for Mac and Windows.
- Verify the installation:
docker-compose --version- Clone the repository:
git clone https://github.com/stesilva/ScholAmigo
cd ScholAmigo- It is possible that
chromium:arm64andchromium-driver:arm64in the Dockerfile will not work for Windows; therefore, Chrome and Chrome driver will need to be installed manually. - After that, the script for scraping_daad.py needs to be changed with the relevant paths:
options.binary_location = r"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe"
...
driver = webdriver.Chrome(executable_path=r"C:\chromedriver\chromedriver.exe", options=options)For Linux-based OS run the following command to create directories for Airflow:
mkdir -p ./dags ./logs ./plugins ./aws ./scripts ./sql ./outputs
echo -e "AIRFLOW_UID=$(id -u)" > .envFor other operating systems, you may get a warning that AIRFLOW_UID is not set, but you can safely ignore it. You can also manually create an .env file in the same folder as docker-compose.yaml with this content to get rid of the warning:
AIRFLOW_UID=50000- Copy DAG files → Place them in the
./dagsfolder. - Create folder called aws and place config and credentials files (sent by email)
- Final structure of the directory should look like this:
ScholAmigo/
├── dags/
│ ├── airflow_batch_daad.py
│ ├── ...
├── logs/
├── plugins/
├── aws/
│ ├── config
│ ├── credentials
├── scripts/
│ ├── entrypoint.sh
│ ├── trusted_zone_daad.py
│ ├── example_script.py
├── sql/
│ ├── create_queries.sql
│ ├── insert_queries.sql
├── outputs/
│ ├── ...
├── kafka_consumer.py
├── requirements.txt
├── docker-compose.yaml
├── Dockerfile
├── Dockerfile-consumer
Navigate to the main folder (with docker-compose.yaml) and run these commands to build and start your Airflow environment:
docker-compose build
docker-compose up airflow-init
docker-compose up- Open your browser and navigate to http://localhost:8080.
- Use the following credentials to log in:
- Username: airflow
- Password: airflow
- Manually trigger the DAGs for demonstration
To visually see messages produced by Kafka producers, open http://localhost:9021.
To visualize the graph that is loaded after triggering the DAG 'load_neo4j_data', open http://localhost:7474.
- Spark transformations will need to be executed outside the container.
To replicate, you will need Spark and Hadoop jars installed locally. - Airflow did not work with Spark in this setup. As a workaround, for Mac users, you can schedule the example script for the DAAD trusted zone (under the
scriptsfolder) using a cron job. This demonstrates that scheduling is possible even without Airflow. - Define API keys for AWS, Pinecone, and Gemini in the appropriate configuration files or environment variables.
Ensure sufficient system resources are allocated to Docker (Airflow recommends allocating 10GB of RAM for Docker, according to instructions provided in this link: https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#fetching-docker-compose-yaml).
In the folder 'outputs', we present the generated files from running the pipeline. These files demonstrate the structure of the data after it has been extracted from the data sources. Furthermore, these files are also stored in the Amazon S3 buckets created for this project.
To execute some of the exploitation zone applications, you can:
- Run the files
user_alumni_recommendationoruser_analyticsas examples. These scripts demonstrate how to use the processed data for recommendations and analytics. - Trigger 'stream' DAG to produce Kafka messages, and observe Kafka consumer's ouputs on the Terminal, where you can see messages with scholarship recommendations whenever consumer receives 'Save' button clicks.
The ScholAmigo architecture is designed to efficiently collect, process, and utilize diverse data sources to recommend scholarships to students.
Data is ingested from scholarship sources and simulated LinkedIn/user activity using web scraping and Python scripts, orchestrated by Airflow. These data streams are sent via Kafka for real-time processing and stored in S3 ingestion buckets (Landing Zone).
Batch processes using Apache Spark transform and validate data, moving it to trusted S3 buckets (Trusted Zone). Further processing prepares data for the Exploitation Zone, where various systems power the recommendation engine:
- A PostgreSQL database supports real-time queries.
- Redis enables real-time recommendations by caching recent user clickstream data.
- Neo4j and Pinecone provide graph-based and embedding-based recommendations for peer and alumni insights.
The architecture supports both real-time and batch processing, ensuring up-to-date and relevant scholarship matches for users.
If you encounter any issues, feel free to open an issue on the GitHub repository.
