This repository is the official implementation of our paper:
Learning Invariant Molecular Representation in Latent Discrete Space
Xiang Zhuang, Qiang Zhang*, Keyan Ding, Yatao Bian, Xiao Wang, Jingsong Lv, Hongyang Chen, Huajun Chen* (* denotes correspondence)
Advances in Neural Information Processing Systems (NeurIPS) 2023
To run the code successfully, the following dependencies need to be installed:
Python 3.8
torch 1.10.1
torch_geometric 2.0.4
torch_scatter 2.0.9
torch_cluster 1.6.0
torch_sparse 0.6.13
torch_spline_conv 1.2.1
rdkit_pypi 2022.9.5
vector_quantize_pytorch 1.0.7
ogb 1.3.6
This repo is also depended on GOOD and DrugOOD, please follow the installation methods provided for each package:
- GOOD (Version 1.1.1)
- Repository: https://github.com/divelab/GOOD/
- Installation: Please follow the instructions provided in the repository to install.
- DrugOOD (Version 0.0.1)
- Repository: https://github.com/tencent-ailab/DrugOOD
- Installation: Please follow the instructions provided in the repository to install.
The data used in the experiments can be downloaded from the following sources:
- GOOD
- DrugOOD
- download from link.
- Extract the downloaded file and save the contents in the
drugood-data-chembl30directory.
An example of the folder hierarchy after adding the data files:
├── data
│ ├── GOODHIV
│ ├── GOODPCBA
│ ├── GOODZINC
├── drugood-data-chembl30
│ ├── lbap_core_ec50_assay.json
│ └── ...
├── models
│ ├── model.py
│ └── ...
├── run.py
└── README.md
python run.py --dataset GOODZINC --domain scaffold --shift concept --num_e 4000 --bs 256 --gamma 0.5 --inv_w 0.01 --reg_w 0.5 --gpu 0 --exp_name ZINC --exp_id scaffold-concept
Running parameters and descriptions are as follows:
| Parameter | Description | Choices |
|---|---|---|
| dataset | name of dataset |
GOODHIV, GOODZINC, GOODPCBA, ic50_assay, ic50_scaffold, ic50_size, ec50_assay, ec50_scaffold, ec50_size. |
| domain | environment-splitting strategy |
scaffold, size. Only need to be specified for datasets in GOOD. |
| shift | type of distribution shift |
covariate, concept. Only need to be specified for datasets in GOOD. |
| num_e | code book size | - |
| bs | batch size | - |
| gamma | threshold |
- |
| inv_w | - | |
| reg_w | - | |
| gpu | which GPU to use | - |
| exp_name | experiment name | - |
| exp_id | experiment ID | - |
We provide the hyperparameters for the training of each dataset in the Appendix, and provide the corresponding checkpoints in the release page.
python eval.py --dataset GOODZINC --domain scaffold --shift concept --load_path checkpoint/GOODZINC-scaffold-concept.pkl
The load_path parameter specifies the path to load the checkpoint.
If you use or extend our work, please cite the paper as follows:
@InProceedings{zhuang2023learning,
title={Learning Invariant Molecular Representation in Latent Discrete Space},
author={Xiang Zhuang and Qiang Zhang and Keyan Ding and Yatao Bian and Xiao Wang and Jingsong Lv and Hongyang Chen and Huajun Chen},
booktile={Advances in Neural Information Processing Systems},
year={2023}
}