A deep learning framework for molecular property prediction using functional group representations π§¬β‘
The Supplementary Information for the paper can be found in this link.
FGMolProp is a deep learning framework for predicting molecular properties using functional group representations. Unlike traditional molecular representations (SMILES, graphs), this approach decomposes molecules into their constituent functional groups, enabling more interpretable and efficient property prediction.
Key Features:
- 𧬠Functional Group Decomposition: Novel representation learning using molecular functional groups
- β‘ Lightning Fast: Built on PyTorch Lightning for scalable training
- π― Multi-Dataset Support: 35+ molecular property datasets including BBBP, BACE, HIV, QM7/8/9
- π§ Flexible Configuration: Hydra-based configuration system for easy experimentation
- π Comprehensive Metrics: Built-in evaluation metrics for classification and regression tasks
- π Pre-training Support: Self-supervised pre-training for improved performance
Supported Datasets:
- ADMET: BBBP, BACE, ClinTox, ESOL, FreeSolv, Lipophilicity, SIDER, Tox21, ToxCast
- Quantum Mechanics: QM7, QM8, QM9
- Biochemical: HIV, MUV, PCBA, ChEMBL
- Custom: Easy integration of new datasets
- Python 3.8+ (3.10 recommended)
- CUDA-capable GPU (optional but recommended)
# Clone the repository
git clone https://github.com/roshanmsb/fgmolprop.git
cd fgmolprop
# Create and activate a virtual environment
python -m venv fgmolprop_env
source fgmolprop_env/bin/activate # On Windows: fgmolprop_env\Scripts\activate
# Install PyTorch (choose appropriate version for your system)
# For CUDA 11.8
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
# For CPU only
# pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
# Install project dependencies
pip install -r requirements.txt
# Install the project in development mode
pip install -e .# Clone the repository
git clone https://github.com/roshanmsb/fgmolprop.git
cd fgmolprop
# Create conda environment
conda env create -f environment.yaml -n fgmolprop
conda activate fgmolprop
# Install the project
pip install -e .# Test the installation
python -c "import src; print('Installation successful!')"
# Check available datasets and experiments
python src/train.py --help
# Verify basic module imports
python -c "
import torch
import lightning as L
import hydra
print(f'PyTorch: {torch.__version__}')
print(f'Lightning: {L.__version__}')
print('All core dependencies loaded successfully!')
"Train on BBBP dataset with default configuration:
# Basic training on CPU
python src/train.py trainer=cpu
# Training on GPU (default)
python src/train.py
# Training with specific dataset
python src/train.py data.dataset=BACE
# Training with custom batch size
python src/train.py data.batch_size=32The framework includes pre-configured experiments for all supported datasets:
# Train on BBBP (Blood-Brain Barrier Penetration)
python src/train.py experiment=bbbp
# Train on BACE (Beta-secretase inhibition)
python src/train.py experiment=bace
# Train on HIV activity prediction
python src/train.py experiment=hiv
# Train on QM9 quantum mechanical properties
python src/train.py experiment=qm9
# Train on Tox21 toxicity prediction
python src/train.py experiment=tox21# 5-fold cross-validation (default)
python src/train.py
# Custom number of folds
python src/train.py n_folds=10
# Train specific fold
python src/train.py data.fold_idx=2# Functional Groups (default)
python src/train.py data.method=FG
# With molecular descriptors
python src/train.py data.descriptors=true
# Different tokenization datasets
python src/train.py data.tokenize_dataset=chembl# Multi-GPU training
python src/train.py trainer=ddp trainer.devices=4
# Mixed precision training
python src/train.py trainer=gpu +trainer.precision=16
# Custom learning rate and optimizer
python src/train.py model.optimizer.lr=0.0001
# Enable model compilation (PyTorch 2.0+)
python src/train.py model.compile=true
# Training with weights & biases logging
python src/train.py logger=wandb
# Debug mode (fast training for testing)
python src/train.py debug=defaultFGMolProp supports self-supervised pre-training for improved performance:
# Pre-train on a large unlabeled dataset
python src/pretrain.py data=pretrain
# Pre-train with custom configuration
python src/pretrain.py data=pretrain model=pretrain trainer.max_epochs=100
# Fine-tune from pre-trained checkpoint
python src/train.py experiment=bbbp ckpt_path="path/to/pretrained/model.ckpt"Modify configs/data/default.yaml or create new dataset configs:
_target_: src.data.datamodules.FGRDataModule
data_dir: "${paths.data_dir}"
dataset: "BBBP" # Dataset name
method: "FG" # Representation method
descriptors: true # Include molecular descriptors
tokenize_dataset: "pubchem" # Tokenization reference
frequency: 500 # Minimum functional group frequency
split_type: "scaffold" # Data splitting method
batch_size: 16 # Training batch size
num_workers: 4 # Data loading workersModify configs/model/default.yaml:
_target_: src.models.fgr_module.FGRLitModule
# Optimizer settings
optimizer:
_target_: src.models.components.losses.SAM # Sharpness-Aware Minimization
lr: 0.001
weight_decay: 0.001
rho: 0.05
# Model architecture
net:
_target_: src.models.components.autoencoder.FGRModel
hidden_dim1: 2048
hidden_dim2: 1024
bottleneck_dim: 256
dropout: 0.1
activation: "relu"| Dataset | Task Type | Description | Molecules |
|---|---|---|---|
| BBBP | Classification | Blood-brain barrier penetration | 2,039 |
| BACE | Classification | Beta-secretase inhibition | 1,513 |
| ClinTox | Classification | Clinical toxicity | 1,478 |
| ESOL | Regression | Aqueous solubility | 1,128 |
| FreeSolv | Regression | Hydration free energy | 642 |
| HIV | Classification | HIV replication inhibition | 41,127 |
| Lipophilicity | Regression | Octanol/water distribution | 4,200 |
| SIDER | Classification | Adverse drug reactions | 1,427 |
| Tox21 | Classification | Toxicity on 12 targets | 7,831 |
| QM7 | Regression | Quantum mechanical properties | 7,165 |
| QM8 | Regression | Electronic spectra | 21,786 |
| QM9 | Regression | Quantum properties | 133,885 |
The framework includes pre-configured experiments for various molecular property datasets. Each experiment can be run with different molecular representation methods:
Representation Methods:
FG: Functional Groups onlyFGR: Functional Groups + Molecular DescriptorsMFG: Modified Functional Groups
Pre-configured Datasets:
- ADMET:
bbbp,bace,clintox,esol,freesolv,lipop,sider,tox21 - Quantum:
qm7,qm8,qm9 - Biochemical:
hiv,muv,pcba - Cell Lines:
1625_aa,a2780,ccrf,du145,hct15,kb,lovo,pc3,skov3 - Chemical Libraries:
chembl,pdbcore,pdbfull,pdbrefined - Others:
cyp,malaria,mpro,schilling
# Evaluate trained model
python src/eval.py ckpt_path="path/to/checkpoint.ckpt"
# Evaluate on specific dataset
python src/eval.py ckpt_path="path/to/checkpoint.ckpt" data.dataset=BACE
# Generate predictions
python src/eval.py ckpt_path="path/to/checkpoint.ckpt" data.dataset=test_data# Configure W&B in configs/logger/wandb.yaml
wandb:
project: "fgmolprop"
entity: "your_username"
# Train with W&B logging
python src/train.py logger=wandb# Train with TensorBoard
python src/train.py logger=tensorboard
# View logs
tensorboard --logdir logs/# Simple CSV logging
python src/train.py logger=csv# Optuna-based hyperparameter search
python src/train.py -m hparams_search=optuna experiment=bbbp
# Manual parameter sweeps
python src/train.py -m model.optimizer.lr=0.001,0.0001,0.00001 data.batch_size=16,32,64
# Grid search over multiple parameters
python src/train.py -m model.net.hidden_dim1=1024,2048 model.net.dropout=0.1,0.2,0.3fgmolprop/
βββ configs/ # Hydra configuration files
β βββ callbacks/ # Training callbacks
β βββ data/ # Dataset configurations
β βββ experiment/ # Pre-defined experiments
β βββ logger/ # Logging configurations
β βββ model/ # Model configurations
β βββ trainer/ # Training configurations
β βββ train.yaml # Main training config
βββ data/ # Data directory
βββ logs/ # Training logs and checkpoints
βββ notebooks/ # Jupyter notebooks for analysis
βββ src/ # Source code
β βββ data/ # Data loading and processing
β βββ models/ # Model architectures and components
β βββ utils/ # Utility functions
β βββ train.py # Training script
β βββ eval.py # Evaluation script
β βββ pretrain.py # Pre-training script
βββ tests/ # Test files
βββ requirements.txt # Python dependencies
βββ environment.yaml # Conda environment
βββ README.md # This file
Missing dependencies:
# Install missing packages
pip install wandb rootutils rich
# Or install all requirements at once
pip install -r requirements.txtCUDA out of memory:
# Reduce batch size
python src/train.py data.batch_size=8
# Enable gradient checkpointing
python src/train.py model.gradient_checkpointing=trueDataset not found:
# Check data directory
python src/train.py paths.data_dir="/path/to/your/data"
# Download datasets automatically (if supported)
python src/train.py data.download=trueImport errors:
# Reinstall in development mode
pip install -e .
# Check environment
python -c "import src; print('OK')"Configuration errors:
# Use + to add new parameters
python src/train.py +trainer.fast_dev_run=1
# Check available configurations
python src/train.py --helpSlow training:
# Increase number of workers
python src/train.py data.num_workers=8
# Enable model compilation
python src/train.py model.compile=true
# Use mixed precision
python src/train.py trainer=gpu +trainer.precision=16- Use appropriate batch size: Start with 16-32, adjust based on GPU memory
- Enable mixed precision: Add
+trainer.precision=16for faster training - Use multiple workers: Set
data.num_workers=4or higher - Model compilation: Enable
model.compile=truewith PyTorch 2.0+ - GPU selection: Use
trainer.devices=[0]to specify GPU
If you use FGMolProp in your research, please cite:
@misc{balaji2025functionalgroupsneedchemically,
title={Functional Groups are All you Need for Chemically Interpretable Molecular Property Prediction},
author={Roshan Balaji and Joe Bobby and Nirav Pravinbhai Bhatt},
year={2025},
eprint={2509.09619},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2509.09619},
}Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with PyTorch Lightning and Hydra
- Molecular data processing with RDKit
- Inspired by advances in molecular representation learning and functional group analysis
Happy molecular property prediction! π§¬β‘