ANL Summer 2025 Research Project
This project analyzes protein sequence divergence using protein language models and multiple sequence alignment techniques. The pipeline processes cytochrome P450 protein sequences from plants, generating embeddings and alignments for downstream analysis.
- Downloads protein sequences from UniProt
- Validates FASTA and TSV files
- Performs data quality checks
- Generates multiple sequence alignments (MSA) using MAFFT
Generates protein sequence embeddings using multiple models:
- ESM2 (residue-level)
- ESM3 (residue-level)
- ESM-MSA-1b (residue-level)
- Prot-T5 (residue-level)
- Self-attention encoding
- Relative position encoding
- Cross-attention fusion models
- Model perturbation analysis
- Model evaluation and metrics
.
├── data/
│ ├── raw/ # Raw UniProt data
│ ├── models/ # Model files
│ ├── msa/ # Multiple sequence alignments
│ └── embeddings/
│ ├── msa/ # MSA-based embeddings
│ ├── sequence/ # Sequence-based embeddings
│ │ ├── esm2/ # ESM2 embeddings
│ │ └── prot_t5/ # Prot-T5 embeddings
│ └── structure/ # Structure-based embeddings
│ └── esm3/ # ESM3 embeddings
├── envs/ # Conda environment configurations
├── external_repos/ # External repositories
├── logs/ # Pipeline execution logs
├── miscellaneous/
│ ├── images/ # Project documentation images
│ └── trash/ # Temporary files
├── models/ # Trained models
├── pipelines/ # Pipeline scripts
│ ├── 00_data_preparation/ # Data preparation pipeline
│ ├── 01_embedding/ # Embedding generation pipeline
│ ├── 02_encoding/ # Encoding pipeline
│ ├── 03_fusion_training/ # Fusion training pipeline
│ ├── 04_perturbation/ # Perturbation analysis pipeline
│ └── 05_evaluation/ # Evaluation pipeline
├── results/ # Analysis results
│ ├── impact_scores/ # Impact analysis scores
│ ├── metrics/ # Performance metrics
│ └── visualizations/ # Visualization outputs
│ ├── dendograms/ # Hierarchical clustering dendrograms
│ │ ├── sequence/ # Sequence-based dendrograms
│ │ │ ├── esm2/ # ESM2 dendrograms
│ │ │ ├── esm3/ # ESM3 dendrograms
│ │ │ └── prot-t5/ # Prot-T5 dendrograms
│ │ └── structure/ # Structure-based dendrograms
│ ├── knn_graph/ # K-nearest neighbor graphs
│ ├── msa/ # MSA visualizations
│ └── umap/ # UMAP clustering visualizations
│ ├── sequence/ # Sequence-based UMAP
│ └── structure/ # Structure-based UMAP
└── scripts/ # Utility scripts
├── data/ # Data processing scripts
│ ├── download_uniprot_data.py
│ ├── data_checks.ipynb
│ └── refine_outcome_variable.py
├── embeddings/ # Embedding extraction scripts
│ ├── extract_esm2_residue_embeddings.py
│ ├── extract_esm3_residue_embeddings.py
│ ├── extract_esm_msa1b_embeddings.py
│ ├── extract_esm-msa-1b_residue_embeddings.py
│ ├── extract_mean_pool_esm2_residue_embeddings.py
│ └── extract_prot-t5_residue_embeddings.py
├── encoding/ # Encoding scripts
│ ├── self_attention_encoder.py
│ └── relative_position_encoder.py
├── fusion/ # Fusion model scripts
│ └── cross_attention_fusion.py
├── modeling/ # Model initialization scripts
│ └── initialize_base_models.py
├── msa/ # MSA generation scripts
│ └── msa_mafft.py
├── pooling/ # Embedding pooling scripts
│ └── pool_embeddings.py
├── setup/ # Environment setup scripts
│ └── initialize_envs.sh
└── visualization/ # Visualization scripts
├── mutual_knn_graph.py
├── UMAP_clustering.py
├── run_umap_clustering.sh
├── visualize_hierarchical_dendogram.py
├── visualize_msa.py
└── view_visualizations.ipynb
-
Clone the repository:
git clone https://github.com/VibhavSetlur/ML_Enzyme_Function.git cd ML_Enzyme_Function -
Initialize conda environments:
cd scripts/setup chmod +x ./initialize_envs.sh ./initialize_envs.sh -
Verify installation:
conda env list
# Run complete pipeline
python pipelines/00_data_preparation/data_preparation.py
# Force re-download data
python pipelines/00_data_preparation/data_preparation.py --force-download
# Skip validation
python pipelines/00_data_preparation/data_preparation.py --skip-validation
# Skip MSA generation
python pipelines/00_data_preparation/data_preparation.py --skip-msa# Generate all embeddings
python pipelines/01_embedding/embed_sequences.py
# Run specific models
python pipelines/01_embedding/embed_sequences.py --models esm2 esm3
# Force regeneration
python pipelines/01_embedding/embed_sequences.py --forceThe project uses separate conda environments for different components:
data_env: Data preparation and validationesm2_env: ESM2 model embeddingsesm3_env: ESM3 model embeddingsesm_msa_1b_env: ESM-MSA-1b model embeddingsmsa_mafft_env: MSA generation with MAFFTpooling_env: Embedding pooling and processingprot_t5_env: Prot-T5 model embeddingsvisualization_env: Visualization and analysis
All pipeline executions are logged in the logs/ directory:
data_preparation.log: Data preparation pipeline logsembedding_pipeline.log: Embedding generation logs
- Create a new branch for your feature
- Make your changes
- Submit a pull request
[Add license information]
[Add contact information]
