CatDRX: Reaction-Conditioned Generative Model for Catalyst Design and Optimization

CatDRX, a catalyst discovery framework powered by a reaction-conditioned variational autoencoder generative model for generating catalysts and predicting their catalytic performance.

Usage 💻

Install environment

This code was tested in Python 3.8 with PyTorch and rdkit

Using Conda: conda create -f catdrx.yaml
Then, activate the environment conda activate catdrx

Dataset

Prepare dataset in dataset/ folder. Dataset should be in .csv format
Indicate dataset metadata in dataset/_dataset.py
The dataset metadata should include:
- file: name of the dataset
- smiles: column names for reactant, reagent, product, and catalyst. All columns are required, except for reagent (can be None)
- task: column name for the task
- ids: column name for the unique id
- splitting: column name for the splitting (train, valid, test). If random splitting is used, the column name can be None
- predictiontask: task name for the prediction task (yield, others)
- time: column name for the reaction time
- condition_dict: dictionary for the condition columns. For current version, only catalyst molecular weight is supported. Please refer to the example in the file.

Pre-trainning

Pre-train your own model

Prepare the dataset

Put the pre-training dataset in dataset/ folder
Insert dataset metadata in dataset/_dataset.py
Example dataset: dataset/ord.csv

Pre-train the model

Run the following command

python3 main_prediction.py \
--file [dataset] \
--epochs [epochs] \
--class_weight disabled \
--augmentation 5 \
--teacher_forcing

[dataset] = name of dataset without .csv extension
[epochs] = number of epochs
[batch_size] = batch size
For other parameters, please refer to catcvae/setup.py file

Use pre-trained model

Download the pre-trained model from here

Fine-tuning

Prepare the dataset

Put the fine-tuning dataset in dataset/ folder
Insert dataset metadata in dataset/_dataset.py
Example dataset: dataset/sm.csv

Fine-tuning the model

For yield prediction task

Run the following command

python3 main_finetune.py \
--file [dataset] \
--alpha [alpha] \
--beta [beta] \
--batch_size [batch_size] \
--epochs [epochs] \
--lr [lr] \
--class_weight [class_weight] \
--teacher_forcing \
--pretrained_file [pretrained_dataset] \
--pretrained_time [pretrained_dataset_folder]

For other catalystic activity prediction tasks

Run the following command

python3 main_finetune_task.py \
--file [dataset] \
--alpha [alpha] \
--beta [beta] \
--batch_size [batch_size] \
--epochs [epochs] \
--lr [lr] \
--class_weight [class_weight] \
--teacher_forcing \
--pretrained_file [pretrained_dataset] \
--pretrained_time [pretrained_dataset_folder]

[dataset] = name of dataset without .csv extension
[alpha] = alpha value for the reconstruction loss function
[beta] = beta value for the KL loss function
[batch_size] = batch size
[epochs] = number of epochs
[lr] = learning rate
[class_weight] = class weight for the loss function (disabled or enabled)
[pretrained_dataset] = name of the pre-trained dataset without .csv extension
[pretrained_dataset_folder] = name of the pre-trained dataset sub-folder (without output_[seed])
For other parameters, please refer to catcvae/setup.py file (Note: the core architecture parameters must be the same as the pre-trained model)
The fined-tuned model will be saved in dataset/[dataset]/output_[seed]_[dateandtime] folder
The performance results will be recorded in dataset folder with the file name dataset/[dataset]/hyper_test.txt

Embedding space

Visualize embedding space

Run the following command

python3 embeddingspace.py \
--file [dataset] \
--pretrained_file [finetuned_dataset] \
--pretrained_time [finetuned_dataset_folder]

[dataset] = name of fine-tuned dataset without .csv extension
[pretrained_dataset] = name of the fine-tuned dataset without .csv extension (mostly save as above)
[pretrained_dataset_folder] = name of the fine-tuned dataset sub-folder (without output_[seed])
The result will be saved in dataset/[pretrained_dataset]/output_[seed]_[pretrained_dataset_folder] folder

Generation and Optimization

Generate new catalysts

Run the following command

python3 generation.py \
--file [dataset] \
--pretrained_file [finetuned_dataset] \
--pretrained_time [finetuned_dataset_folder] \
--correction [correction] \
--from_around_mol [from_around_mol] \
--from_around_mol_cond [from_around_mol_cond] \
--from_training_space [from_training_space]

[dataset] = name of dataset without .csv extension
[pretrained_dataset] = name of the fine-tuned dataset without .csv extension
[pretrained_dataset_folder] = name of the fine-tuned dataset sub-folder (without output_[seed])
[correction] = correction in post-processing step (disabled or enabled)
[from_around_mol] = generate using sampled molecule from training set (disabled or enabled)
[from_around_mol_cond] = generate using sampled molecule's condition (disabled or enabled)
[from_training_space] = generate limited from the training space (disabled or enabled)
For other setups related to number of molecules, task-specific validity, and generation parameters, please directly edit the generation.py file
The result will be saved in dataset/[pretrained_dataset]/output_[seed]_[pretrained_dataset_folder] folder

Generate with optimization

Run the following command

python3 optimization.py \
--file [dataset] \
--pretrained_file [finetuned_dataset] \
--pretrained_time [finetuned_dataset_folder] \
--opt_strategy [opt_strategy] \

[dataset] = name of dataset without .csv extension
[pretrained_dataset] = name of the fine-tuned dataset without .csv extension
[pretrained_dataset_folder] = name of the fine-tuned dataset sub-folder (without output_[seed])
[opt_strategy] = optimization strategy (at_random', 'around_target')
For other setups related to number of molecules, objective function, and optimization parameters, please directly edit the optimization.py file
The result will be saved in dataset/[pretrained_dataset]/output_[seed]_[pretrained_dataset_folder] folder

Citation 📃

Kengkanna A., Kikuchi Y., Niwa T., Ohue M. Reaction-conditioned generative model for catalyst design and optimization with CatDRX. Communications Chemistry, 8: 314, 2025. doi: 10.1038/s42004-025-01732-7

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
blob		blob
catcvae		catcvae
dataset		dataset
.gitignore		.gitignore
LICENSE		LICENSE
catdrx.yml		catdrx.yml
embeddingspace.py		embeddingspace.py
generation.py		generation.py
generation_analysis.ipynb		generation_analysis.ipynb
main_finetune.py		main_finetune.py
main_finetune_task.py		main_finetune_task.py
main_prediction.py		main_prediction.py
optimization.py		optimization.py
optimization_analysis.ipynb		optimization_analysis.ipynb
prediction_testing.ipynb		prediction_testing.ipynb
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CatDRX: Reaction-Conditioned Generative Model for Catalyst Design and Optimization

Usage 💻

Install environment

Dataset

Pre-trainning

Pre-train your own model

Use pre-trained model

Fine-tuning

For yield prediction task

For other catalystic activity prediction tasks

Embedding space

Generation and Optimization

Citation 📃

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

ohuelab/CatDRX

Folders and files

Latest commit

History

Repository files navigation

CatDRX: Reaction-Conditioned Generative Model for Catalyst Design and Optimization

Usage 💻

Install environment

Dataset

Pre-trainning

Pre-train your own model

Use pre-trained model

Fine-tuning

For yield prediction task

For other catalystic activity prediction tasks

Embedding space

Generation and Optimization

Citation 📃

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages