ADCNet

Semi-supervised learning for ADC property prediction - PyTorch Implementation

Project Overview

This repository contains our successful replication of the ADCNet model designed by the original authors. We have converted the entire architecture from TensorFlow to PyTorch framework, achieving superior performance compared to the original implementation.

Key Achievements

Framework Migration: Successfully converted the complete ADCNet architecture from TensorFlow to PyTorch
Weight Conversion: Performed complex weight conversion from TensorFlow .h5 format to PyTorch .pth format for the FG-BERT encoder
Performance Improvement: Achieved 89% test accuracy (+2% improvement over the original 87% reported by authors)
Hyperparameter Consistency: Trained using identical hyperparameters as the original authors to ensure fair comparison

Deep Technical Implementation

FG-BERT Encoder Conversion Challenge

FG-BERT (Functional Group-BERT) is a self-supervised deep learning framework designed to enhance molecular representation learning by focusing on functional groups within molecules. Developed by idrugLab, FG-BERT leverages the Transformer architecture to pretrain on approximately 1.45 million unlabeled drug-like molecules, enabling the model to learn meaningful representations by masking and predicting functional groups, trained on 44 benchmark datasets One of the most significant technical challenges was converting the FG-BERT encoder weights from TensorFlow to PyTorch. The FG-BERT encoder is a critical component of ADCNet, specifically designed for molecular representation learning:

Original Format: TensorFlow .h5 weights with specific layer naming conventions
Target Format: PyTorch .pth state dictionaries with different tensor layouts
Key Challenges:
- Tensor dimension reordering (TensorFlow uses different conventions than PyTorch)
- Layer naming scheme conversion
- Attention mechanism weight mapping
- Embedding layer parameter alignment

The FG-BERT encoder serves as the backbone for understanding molecular structures and relationships within the ADC (Antibody-Drug Conjugate) context. Our successful conversion ensures that all pre-trained knowledge from the original model is preserved while leveraging PyTorch's advantages.

Model Architecture Understanding

We conducted an in-depth analysis of the ADCNet architecture to ensure accurate replication:

Multi-Head Attention Mechanisms: Converted scaled dot-product attention with proper mask and adjacency matrix handling
Encoder-Decoder Structure: Maintained the original transformer-based architecture
Custom Activation Functions: Implemented GELU activation functions consistent with the original design
Regularization Techniques: Preserved dropout patterns and layer normalization strategies

Dataset Limitations and Considerations

Current Bottleneck

The primary limitation of this work is the constrained dataset size of only 435 samples. This small dataset size presents challenges for:

Model generalization capability
Statistical significance of improvements
Comprehensive evaluation across diverse molecular structures
Robust validation of the converted model
Creating Larger Model

Despite this limitation, our 2% improvement demonstrates the effectiveness of the PyTorch implementation and suggests potential for further enhancement with larger datasets.

Future Improvements

This work will be further enhanced in the following directions:

1. Enhanced Molecular Understanding

Graph Neural Network Integration: Incorporate advanced GNN architectures to better capture molecular topology
Attention Mechanism Refinement: Develop more sophisticated attention patterns for molecular interactions
Multi-Scale Feature Learning: Implement hierarchical feature extraction for different molecular scales

2. Accuracy Improvements

Ensemble Methods: Combine multiple model variants for improved prediction stability
Advanced Regularization: Implement modern regularization techniques (e.g., dropout variants, batch normalization alternatives)
Transfer Learning: Leverage larger pre-trained molecular models for enhanced representation learning

3. Dataset Expansion

Data Augmentation: Develop molecular-specific augmentation strategies
Active Learning: Implement strategies to identify most informative samples for labeling
Cross-Domain Transfer: Explore knowledge transfer from related molecular prediction tasks

Description of the Document

py37.yaml contains the version specifications for various packages in the installed environment. The Embeddings folder contains antibody heavy and light chains, antigen macromolecule embeddings, Weights folder contains the weights of FG-BERT in pytorch, classification_weights folder contains model weights file.

Required Packages

ADCNet Environment Installation

conda create -n ADCNet python==3.10
pip install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip install rdkit
pip install numpy
pip install pandas
pip install matplotlib
pip install hyperopt
pip install scikit-learn
pip install torch
pip install openpyxl
pip install fair-esm

Refer to this link to download openbabel https://openbabel.org/docs/Installation/install.html#compile-language-bindings

Usage Examples

Training ADCNet

Run ESM-2.py to obtain embeddings for antibody heavy chain, light chain, and antigen
Ensure each data entry contains the DAR value
Create a folder named "medium3_weights" and place "bert_weightsMedium_20.h5" into it

conda activate ADCNet
python class.py

Using ADCNet for Inference

Run ESM-2.py to obtain embeddings for antibody heavy chain, light chain, and antigen
Ensure each data entry contains the DAR value
Create a folder named "classification_weights" and place "ADC_9.h5" into it

For reproducing results, run class.py directly.

conda activate ADCNet
python inference.py

Citation

If you use this PyTorch implementation in your research, please cite both the original ADCNet paper and acknowledge this implementation work.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Embeddings		Embeddings
Weights		Weights
classification_weights		classification_weights
ADCNet.svg		ADCNet.svg
ADCNet_output.csv		ADCNet_output.csv
ESM2.py		ESM2.py
README.md		README.md
class.py		class.py
data.xlsx		data.xlsx
dataset.py		dataset.py
inference.py		inference.py
model.py		model.py
single_inference.py		single_inference.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ADCNet

Project Overview

Key Achievements

Deep Technical Implementation

FG-BERT Encoder Conversion Challenge

Model Architecture Understanding

Dataset Limitations and Considerations

Current Bottleneck

Future Improvements

1. Enhanced Molecular Understanding

2. Accuracy Improvements

3. Dataset Expansion

Description of the Document

Required Packages

ADCNet Environment Installation

Usage Examples

Training ADCNet

Using ADCNet for Inference

Citation

About

Uh oh!

Releases

Packages

Languages

rushikatabathuni/ADCNet_PyTorch

Folders and files

Latest commit

History

Repository files navigation

ADCNet

Project Overview

Key Achievements

Deep Technical Implementation

FG-BERT Encoder Conversion Challenge

Model Architecture Understanding

Dataset Limitations and Considerations

Current Bottleneck

Future Improvements

1. Enhanced Molecular Understanding

2. Accuracy Improvements

3. Dataset Expansion

Description of the Document

Required Packages

ADCNet Environment Installation

Usage Examples

Training ADCNet

Using ADCNet for Inference

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages