Skip to content

VibhavSetlur/ML_Enzyme_Function

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ANL ML Enzyme Function

ANL Summer 2025 Research Project

Project Overview

This project analyzes protein sequence divergence using protein language models and multiple sequence alignment techniques. The pipeline processes cytochrome P450 protein sequences from plants, generating embeddings and alignments for downstream analysis.

Project Workflow

Project Workflow

Pipeline Components

1. Data Preparation (pipelines/00_data_preparation/)

  • Downloads protein sequences from UniProt
  • Validates FASTA and TSV files
  • Performs data quality checks
  • Generates multiple sequence alignments (MSA) using MAFFT

2. Sequence Embedding (pipelines/01_embedding/)

Generates protein sequence embeddings using multiple models:

  • ESM2 (residue-level)
  • ESM3 (residue-level)
  • ESM-MSA-1b (residue-level)
  • Prot-T5 (residue-level)

3. Encoding (pipelines/02_encoding/)

  • Self-attention encoding
  • Relative position encoding

4. Fusion Training (pipelines/03_fusion_training/)

  • Cross-attention fusion models

5. Perturbation (pipelines/04_perturbation/)

  • Model perturbation analysis

6. Evaluation (pipelines/05_evaluation/)

  • Model evaluation and metrics

Directory Structure

.
├── data/
│   ├── raw/                    # Raw UniProt data
│   ├── models/                 # Model files
│   ├── msa/                    # Multiple sequence alignments
│   └── embeddings/
│       ├── msa/                # MSA-based embeddings
│       ├── sequence/           # Sequence-based embeddings
│       │   ├── esm2/          # ESM2 embeddings
│       │   └── prot_t5/       # Prot-T5 embeddings
│       └── structure/          # Structure-based embeddings
│           └── esm3/          # ESM3 embeddings
├── envs/                       # Conda environment configurations
├── external_repos/             # External repositories
├── logs/                       # Pipeline execution logs
├── miscellaneous/
│   ├── images/                 # Project documentation images
│   └── trash/                  # Temporary files
├── models/                     # Trained models
├── pipelines/                  # Pipeline scripts
│   ├── 00_data_preparation/    # Data preparation pipeline
│   ├── 01_embedding/           # Embedding generation pipeline
│   ├── 02_encoding/            # Encoding pipeline
│   ├── 03_fusion_training/     # Fusion training pipeline
│   ├── 04_perturbation/        # Perturbation analysis pipeline
│   └── 05_evaluation/          # Evaluation pipeline
├── results/                    # Analysis results
│   ├── impact_scores/          # Impact analysis scores
│   ├── metrics/                # Performance metrics
│   └── visualizations/         # Visualization outputs
│       ├── dendograms/         # Hierarchical clustering dendrograms
│       │   ├── sequence/       # Sequence-based dendrograms
│       │   │   ├── esm2/       # ESM2 dendrograms
│       │   │   ├── esm3/       # ESM3 dendrograms
│       │   │   └── prot-t5/    # Prot-T5 dendrograms
│       │   └── structure/      # Structure-based dendrograms
│       ├── knn_graph/          # K-nearest neighbor graphs
│       ├── msa/                # MSA visualizations
│       └── umap/               # UMAP clustering visualizations
│           ├── sequence/       # Sequence-based UMAP
│           └── structure/      # Structure-based UMAP
└── scripts/                    # Utility scripts
    ├── data/                   # Data processing scripts
    │   ├── download_uniprot_data.py
    │   ├── data_checks.ipynb
    │   └── refine_outcome_variable.py
    ├── embeddings/             # Embedding extraction scripts
    │   ├── extract_esm2_residue_embeddings.py
    │   ├── extract_esm3_residue_embeddings.py
    │   ├── extract_esm_msa1b_embeddings.py
    │   ├── extract_esm-msa-1b_residue_embeddings.py
    │   ├── extract_mean_pool_esm2_residue_embeddings.py
    │   └── extract_prot-t5_residue_embeddings.py
    ├── encoding/               # Encoding scripts
    │   ├── self_attention_encoder.py
    │   └── relative_position_encoder.py
    ├── fusion/                 # Fusion model scripts
    │   └── cross_attention_fusion.py
    ├── modeling/               # Model initialization scripts
    │   └── initialize_base_models.py
    ├── msa/                    # MSA generation scripts
    │   └── msa_mafft.py
    ├── pooling/                # Embedding pooling scripts
    │   └── pool_embeddings.py
    ├── setup/                  # Environment setup scripts
    │   └── initialize_envs.sh
    └── visualization/          # Visualization scripts
        ├── mutual_knn_graph.py
        ├── UMAP_clustering.py
        ├── run_umap_clustering.sh
        ├── visualize_hierarchical_dendogram.py
        ├── visualize_msa.py
        └── view_visualizations.ipynb

Setup and Installation

  1. Clone the repository:

    git clone https://github.com/VibhavSetlur/ML_Enzyme_Function.git
    cd ML_Enzyme_Function
  2. Initialize conda environments:

    cd scripts/setup
    chmod +x ./initialize_envs.sh
    ./initialize_envs.sh
  3. Verify installation:

    conda env list

Usage

Data Preparation Pipeline

# Run complete pipeline
python pipelines/00_data_preparation/data_preparation.py

# Force re-download data
python pipelines/00_data_preparation/data_preparation.py --force-download

# Skip validation
python pipelines/00_data_preparation/data_preparation.py --skip-validation

# Skip MSA generation
python pipelines/00_data_preparation/data_preparation.py --skip-msa

Embedding Pipeline

# Generate all embeddings
python pipelines/01_embedding/embed_sequences.py

# Run specific models
python pipelines/01_embedding/embed_sequences.py --models esm2 esm3

# Force regeneration
python pipelines/01_embedding/embed_sequences.py --force

Environment Management

The project uses separate conda environments for different components:

  • data_env: Data preparation and validation
  • esm2_env: ESM2 model embeddings
  • esm3_env: ESM3 model embeddings
  • esm_msa_1b_env: ESM-MSA-1b model embeddings
  • msa_mafft_env: MSA generation with MAFFT
  • pooling_env: Embedding pooling and processing
  • prot_t5_env: Prot-T5 model embeddings
  • visualization_env: Visualization and analysis

Logging

All pipeline executions are logged in the logs/ directory:

  • data_preparation.log: Data preparation pipeline logs
  • embedding_pipeline.log: Embedding generation logs

Contributing

  1. Create a new branch for your feature
  2. Make your changes
  3. Submit a pull request

License

[Add license information]

Contact

[Add contact information]

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages