This project implements a Graph Variational Autoencoder (GVAE) with an attention-based encoder (GATv2) to learn meaningful latent representations of nodes in a Protein-Protein Interaction (PPI) network.
The model is designed to reconstruct the graph structure and, in doing so, learn powerful node embeddings that can be used for various downstream bioinformatics tasks, such as link prediction, community detection, and functional module discovery.
The core of this project is a Graph Attention VAE.
- Encoder: The encoder uses a series of
GATv2Conv(Graph Attention) layers to process the entire graph and produce a latent distribution (mean and log-variance) for each node. This allows the model to learn a rich, topology-aware representation for every protein. - Decoder: The decoder reconstructs the graph's adjacency matrix by taking the inner product of the latent node embeddings. This is a direct and efficient method for graph generation from a latent space.
- Loss Function: The model is trained by optimizing a combination of:
- Reconstruction Loss: A weighted Binary Cross-Entropy loss that ensures the decoded graph is similar to the original.
- KL Divergence Loss: A regularization term that ensures the learned latent space is smooth and well-structured.
This architecture has proven highly effective for link prediction on the provided dataset, achieving an ROC AUC of over 0.99.
This project has been updated to support cross-platform execution on:
- NVIDIA GPUs (CUDA) for Linux/Windows.
- Apple Silicon GPUs (MPS) for macOS.
- CPUs as a fallback.
The system automatically detects the best available hardware accelerator.
Prerequisites:
- Python 3.10 or higher (3.10, 3.11, 3.12 are supported and tested).
Create a virtual environment and install the required dependencies:
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txtThe model expects a processed graph. If you have a raw edgelist (e.g., data/raw/hippie_current.tsv), run the preprocessing script:
python3 -m src.data.preprocessThis will generate the necessary adj.npz, node2idx.json, and stats.json files in the data/processed directory.
Alternatively, you can generate a synthetic dataset for testing:
python3 -m src.data.synthetic_generator --output_dir data/processed --num_nodes 100 --num_layers 2To train the GVAE model, run the main experiment script:
python3 -m src.experiments.run_gvae_experimentThis will start the training process using the parameters defined in src/experiments/configs/gvae_config.yaml. The script will automatically select the best available device ("auto"). It will periodically print evaluation metrics and, upon completion, save the final node embeddings to experiments/gvae_run/final_embeddings.pth.
The primary output of the GVAE is the learned node embeddings (final_embeddings.pth). These embeddings can be used for further biological analysis:
- Cluster Analysis:
python3 -m src.experiments.analyze_clusters --embedding_path experiments/gvae_run/final_embeddings.pth
- Link Prediction:
python3 -m src.experiments.predict_fgfr_interactions --skip_api
This project includes a GitHub Actions workflow (.github/workflows/main.yml) that automatically:
- Sets up a Python environment (3.10/3.11/3.12).
- Installs dependencies.
- Generates synthetic data.
- Trains the model.
- Runs analysis and figure generation scripts.
- Uploads generated figures as artifacts.
This ensures robustness across Ubuntu (CUDA-ready environment) and macOS (MPS-ready environment).
MD. Arshad (Sulkysubject37) Jamia Millia Islamia
This project is licensed under the MIT License - see the LICENSE file for details. Copyright (c) 2026 MD. Arshad (Sulkysubject37)