Skip to content

bisect-group/fgmolprop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

99 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

FGMolProp: Functional Group-based Molecular Property Prediction

python pytorch lightning hydra

A deep learning framework for molecular property prediction using functional group representations 🧬⚑


Supplementary Information

The Supplementary Information for the paper can be found in this link.

πŸ“Œ Overview

FGMolProp is a deep learning framework for predicting molecular properties using functional group representations. Unlike traditional molecular representations (SMILES, graphs), this approach decomposes molecules into their constituent functional groups, enabling more interpretable and efficient property prediction.

Key Features:

  • 🧬 Functional Group Decomposition: Novel representation learning using molecular functional groups
  • ⚑ Lightning Fast: Built on PyTorch Lightning for scalable training
  • 🎯 Multi-Dataset Support: 35+ molecular property datasets including BBBP, BACE, HIV, QM7/8/9
  • πŸ”§ Flexible Configuration: Hydra-based configuration system for easy experimentation
  • πŸ“Š Comprehensive Metrics: Built-in evaluation metrics for classification and regression tasks
  • πŸš€ Pre-training Support: Self-supervised pre-training for improved performance

Supported Datasets:

  • ADMET: BBBP, BACE, ClinTox, ESOL, FreeSolv, Lipophilicity, SIDER, Tox21, ToxCast
  • Quantum Mechanics: QM7, QM8, QM9
  • Biochemical: HIV, MUV, PCBA, ChEMBL
  • Custom: Easy integration of new datasets

πŸš€ Installation

Prerequisites

  • Python 3.8+ (3.10 recommended)
  • CUDA-capable GPU (optional but recommended)

Option 1: Using pip (Recommended)

# Clone the repository
git clone https://github.com/roshanmsb/fgmolprop.git
cd fgmolprop

# Create and activate a virtual environment
python -m venv fgmolprop_env
source fgmolprop_env/bin/activate  # On Windows: fgmolprop_env\Scripts\activate

# Install PyTorch (choose appropriate version for your system)
# For CUDA 11.8
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

# For CPU only
# pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu

# Install project dependencies
pip install -r requirements.txt

# Install the project in development mode
pip install -e .

Option 2: Using conda

# Clone the repository
git clone https://github.com/roshanmsb/fgmolprop.git
cd fgmolprop

# Create conda environment
conda env create -f environment.yaml -n fgmolprop
conda activate fgmolprop

# Install the project
pip install -e .

Verify Installation

# Test the installation
python -c "import src; print('Installation successful!')"

# Check available datasets and experiments
python src/train.py --help

# Verify basic module imports
python -c "
import torch
import lightning as L
import hydra
print(f'PyTorch: {torch.__version__}')
print(f'Lightning: {L.__version__}')
print('All core dependencies loaded successfully!')
"

🎯 Training

Quick Start

Train on BBBP dataset with default configuration:

# Basic training on CPU
python src/train.py trainer=cpu

# Training on GPU (default)
python src/train.py

# Training with specific dataset
python src/train.py data.dataset=BACE

# Training with custom batch size
python src/train.py data.batch_size=32

Available Experiments

The framework includes pre-configured experiments for all supported datasets:

# Train on BBBP (Blood-Brain Barrier Penetration)
python src/train.py experiment=bbbp

# Train on BACE (Beta-secretase inhibition)
python src/train.py experiment=bace

# Train on HIV activity prediction
python src/train.py experiment=hiv

# Train on QM9 quantum mechanical properties
python src/train.py experiment=qm9

# Train on Tox21 toxicity prediction
python src/train.py experiment=tox21

Cross-Validation Training

# 5-fold cross-validation (default)
python src/train.py

# Custom number of folds
python src/train.py n_folds=10

# Train specific fold
python src/train.py data.fold_idx=2

Training with Different Molecular Representations

# Functional Groups (default)
python src/train.py data.method=FG

# With molecular descriptors
python src/train.py data.descriptors=true

# Different tokenization datasets
python src/train.py data.tokenize_dataset=chembl

Advanced Training Options

# Multi-GPU training
python src/train.py trainer=ddp trainer.devices=4

# Mixed precision training
python src/train.py trainer=gpu +trainer.precision=16

# Custom learning rate and optimizer
python src/train.py model.optimizer.lr=0.0001

# Enable model compilation (PyTorch 2.0+)
python src/train.py model.compile=true

# Training with weights & biases logging
python src/train.py logger=wandb

# Debug mode (fast training for testing)
python src/train.py debug=default

πŸ”¬ Pre-training

FGMolProp supports self-supervised pre-training for improved performance:

# Pre-train on a large unlabeled dataset
python src/pretrain.py data=pretrain

# Pre-train with custom configuration
python src/pretrain.py data=pretrain model=pretrain trainer.max_epochs=100

# Fine-tune from pre-trained checkpoint
python src/train.py experiment=bbbp ckpt_path="path/to/pretrained/model.ckpt"

βš™οΈ Configuration

Dataset Configuration

Modify configs/data/default.yaml or create new dataset configs:

_target_: src.data.datamodules.FGRDataModule
data_dir: "${paths.data_dir}"
dataset: "BBBP"           # Dataset name
method: "FG"              # Representation method
descriptors: true         # Include molecular descriptors
tokenize_dataset: "pubchem"  # Tokenization reference
frequency: 500            # Minimum functional group frequency
split_type: "scaffold"    # Data splitting method
batch_size: 16           # Training batch size
num_workers: 4           # Data loading workers

Model Configuration

Modify configs/model/default.yaml:

_target_: src.models.fgr_module.FGRLitModule

# Optimizer settings
optimizer:
  _target_: src.models.components.losses.SAM  # Sharpness-Aware Minimization
  lr: 0.001
  weight_decay: 0.001
  rho: 0.05

# Model architecture
net:
  _target_: src.models.components.autoencoder.FGRModel
  hidden_dim1: 2048
  hidden_dim2: 1024
  bottleneck_dim: 256
  dropout: 0.1
  activation: "relu"

Supported Datasets

Dataset Task Type Description Molecules
BBBP Classification Blood-brain barrier penetration 2,039
BACE Classification Beta-secretase inhibition 1,513
ClinTox Classification Clinical toxicity 1,478
ESOL Regression Aqueous solubility 1,128
FreeSolv Regression Hydration free energy 642
HIV Classification HIV replication inhibition 41,127
Lipophilicity Regression Octanol/water distribution 4,200
SIDER Classification Adverse drug reactions 1,427
Tox21 Classification Toxicity on 12 targets 7,831
QM7 Regression Quantum mechanical properties 7,165
QM8 Regression Electronic spectra 21,786
QM9 Regression Quantum properties 133,885

Available Experiments

The framework includes pre-configured experiments for various molecular property datasets. Each experiment can be run with different molecular representation methods:

Representation Methods:

  • FG: Functional Groups only
  • FGR: Functional Groups + Molecular Descriptors
  • MFG: Modified Functional Groups

Pre-configured Datasets:

  • ADMET: bbbp, bace, clintox, esol, freesolv, lipop, sider, tox21
  • Quantum: qm7, qm8, qm9
  • Biochemical: hiv, muv, pcba
  • Cell Lines: 1625_aa, a2780, ccrf, du145, hct15, kb, lovo, pc3, skov3
  • Chemical Libraries: chembl, pdbcore, pdbfull, pdbrefined
  • Others: cyp, malaria, mpro, schilling

πŸ“Š Evaluation

# Evaluate trained model
python src/eval.py ckpt_path="path/to/checkpoint.ckpt"

# Evaluate on specific dataset
python src/eval.py ckpt_path="path/to/checkpoint.ckpt" data.dataset=BACE

# Generate predictions
python src/eval.py ckpt_path="path/to/checkpoint.ckpt" data.dataset=test_data

πŸ“ˆ Monitoring and Logging

Weights & Biases Integration

# Configure W&B in configs/logger/wandb.yaml
wandb:
  project: "fgmolprop"
  entity: "your_username"

# Train with W&B logging
python src/train.py logger=wandb

TensorBoard Logging

# Train with TensorBoard
python src/train.py logger=tensorboard

# View logs
tensorboard --logdir logs/

CSV Logging

# Simple CSV logging
python src/train.py logger=csv

πŸ” Hyperparameter Optimization

# Optuna-based hyperparameter search
python src/train.py -m hparams_search=optuna experiment=bbbp

# Manual parameter sweeps
python src/train.py -m model.optimizer.lr=0.001,0.0001,0.00001 data.batch_size=16,32,64

# Grid search over multiple parameters
python src/train.py -m model.net.hidden_dim1=1024,2048 model.net.dropout=0.1,0.2,0.3

πŸ“ Project Structure

fgmolprop/
β”œβ”€β”€ configs/                    # Hydra configuration files
β”‚   β”œβ”€β”€ callbacks/             # Training callbacks
β”‚   β”œβ”€β”€ data/                  # Dataset configurations
β”‚   β”œβ”€β”€ experiment/            # Pre-defined experiments
β”‚   β”œβ”€β”€ logger/                # Logging configurations
β”‚   β”œβ”€β”€ model/                 # Model configurations
β”‚   β”œβ”€β”€ trainer/               # Training configurations
β”‚   └── train.yaml             # Main training config
β”œβ”€β”€ data/                      # Data directory
β”œβ”€β”€ logs/                      # Training logs and checkpoints
β”œβ”€β”€ notebooks/                 # Jupyter notebooks for analysis
β”œβ”€β”€ src/                       # Source code
β”‚   β”œβ”€β”€ data/                  # Data loading and processing
β”‚   β”œβ”€β”€ models/                # Model architectures and components
β”‚   β”œβ”€β”€ utils/                 # Utility functions
β”‚   β”œβ”€β”€ train.py               # Training script
β”‚   β”œβ”€β”€ eval.py                # Evaluation script
β”‚   └── pretrain.py            # Pre-training script
β”œβ”€β”€ tests/                     # Test files
β”œβ”€β”€ requirements.txt           # Python dependencies
β”œβ”€β”€ environment.yaml           # Conda environment
└── README.md                  # This file

πŸ› οΈ Troubleshooting

Common Issues

Common Issues

Missing dependencies:

# Install missing packages
pip install wandb rootutils rich

# Or install all requirements at once
pip install -r requirements.txt

CUDA out of memory:

# Reduce batch size
python src/train.py data.batch_size=8

# Enable gradient checkpointing
python src/train.py model.gradient_checkpointing=true

Dataset not found:

# Check data directory
python src/train.py paths.data_dir="/path/to/your/data"

# Download datasets automatically (if supported)
python src/train.py data.download=true

Import errors:

# Reinstall in development mode
pip install -e .

# Check environment
python -c "import src; print('OK')"

Configuration errors:

# Use + to add new parameters
python src/train.py +trainer.fast_dev_run=1

# Check available configurations
python src/train.py --help

Slow training:

# Increase number of workers
python src/train.py data.num_workers=8

# Enable model compilation
python src/train.py model.compile=true

# Use mixed precision
python src/train.py trainer=gpu +trainer.precision=16

Performance Tips

  1. Use appropriate batch size: Start with 16-32, adjust based on GPU memory
  2. Enable mixed precision: Add +trainer.precision=16 for faster training
  3. Use multiple workers: Set data.num_workers=4 or higher
  4. Model compilation: Enable model.compile=true with PyTorch 2.0+
  5. GPU selection: Use trainer.devices=[0] to specify GPU

πŸ“š Citation

If you use FGMolProp in your research, please cite:

@misc{balaji2025functionalgroupsneedchemically,
      title={Functional Groups are All you Need for Chemically Interpretable Molecular Property Prediction}, 
      author={Roshan Balaji and Joe Bobby and Nirav Pravinbhai Bhatt},
      year={2025},
      eprint={2509.09619},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2509.09619}, 
}

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments

  • Built with PyTorch Lightning and Hydra
  • Molecular data processing with RDKit
  • Inspired by advances in molecular representation learning and functional group analysis


Happy molecular property prediction! 🧬⚑

About

Molecular property prediction using functional groups

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published