CatDRX, a catalyst discovery framework powered by a reaction-conditioned variational autoencoder generative model for generating catalysts and predicting their catalytic performance.
This code was tested in Python 3.8 with PyTorch and rdkit
- Using Conda:
conda create -f catdrx.yaml - Then, activate the environment
conda activate catdrx
- Prepare dataset in
dataset/folder. Dataset should be in.csvformat - Indicate dataset metadata in
dataset/_dataset.py - The dataset metadata should include:
file: name of the datasetsmiles: column names forreactant,reagent,product, andcatalyst. All columns are required, except forreagent(can be None)task: column name for the taskids: column name for the unique idsplitting: column name for the splitting (train, valid, test). If random splitting is used, the column name can be Nonepredictiontask: task name for the prediction task (yield, others)time: column name for the reaction timecondition_dict: dictionary for the condition columns. For current version, only catalyst molecular weight is supported. Please refer to the example in the file.
- Prepare the dataset
- Put the pre-training dataset in
dataset/folder - Insert dataset metadata in
dataset/_dataset.py - Example dataset:
dataset/ord.csv
- Pre-train the model
- Run the following command
python3 main_prediction.py \
--file [dataset] \
--epochs [epochs] \
--class_weight disabled \
--augmentation 5 \
--teacher_forcing- [dataset] = name of dataset without
.csvextension - [epochs] = number of epochs
- [batch_size] = batch size
- For other parameters, please refer to
catcvae/setup.pyfile
- Download the pre-trained model from here
- Prepare the dataset
- Put the fine-tuning dataset in
dataset/folder - Insert dataset metadata in
dataset/_dataset.py - Example dataset:
dataset/sm.csv
- Fine-tuning the model
- Run the following command
python3 main_finetune.py \
--file [dataset] \
--alpha [alpha] \
--beta [beta] \
--batch_size [batch_size] \
--epochs [epochs] \
--lr [lr] \
--class_weight [class_weight] \
--teacher_forcing \
--pretrained_file [pretrained_dataset] \
--pretrained_time [pretrained_dataset_folder]- Run the following command
python3 main_finetune_task.py \
--file [dataset] \
--alpha [alpha] \
--beta [beta] \
--batch_size [batch_size] \
--epochs [epochs] \
--lr [lr] \
--class_weight [class_weight] \
--teacher_forcing \
--pretrained_file [pretrained_dataset] \
--pretrained_time [pretrained_dataset_folder]- [dataset] = name of dataset without
.csvextension - [alpha] = alpha value for the reconstruction loss function
- [beta] = beta value for the KL loss function
- [batch_size] = batch size
- [epochs] = number of epochs
- [lr] = learning rate
- [class_weight] = class weight for the loss function (disabled or enabled)
- [pretrained_dataset] = name of the pre-trained dataset without
.csvextension - [pretrained_dataset_folder] = name of the pre-trained dataset sub-folder (without
output_[seed]) - For other parameters, please refer to
catcvae/setup.pyfile (Note: the core architecture parameters must be the same as the pre-trained model) - The fined-tuned model will be saved in
dataset/[dataset]/output_[seed]_[dateandtime]folder - The performance results will be recorded in dataset folder with the file name
dataset/[dataset]/hyper_test.txt
- Visualize embedding space
- Run the following command
python3 embeddingspace.py \
--file [dataset] \
--pretrained_file [finetuned_dataset] \
--pretrained_time [finetuned_dataset_folder]- [dataset] = name of fine-tuned dataset without
.csvextension - [pretrained_dataset] = name of the fine-tuned dataset without
.csvextension (mostly save as above) - [pretrained_dataset_folder] = name of the fine-tuned dataset sub-folder (without
output_[seed]) - The result will be saved in
dataset/[pretrained_dataset]/output_[seed]_[pretrained_dataset_folder]folder
- Generate new catalysts
- Run the following command
python3 generation.py \
--file [dataset] \
--pretrained_file [finetuned_dataset] \
--pretrained_time [finetuned_dataset_folder] \
--correction [correction] \
--from_around_mol [from_around_mol] \
--from_around_mol_cond [from_around_mol_cond] \
--from_training_space [from_training_space] - [dataset] = name of dataset without
.csvextension - [pretrained_dataset] = name of the fine-tuned dataset without
.csvextension - [pretrained_dataset_folder] = name of the fine-tuned dataset sub-folder (without
output_[seed]) - [correction] = correction in post-processing step (disabled or enabled)
- [from_around_mol] = generate using sampled molecule from training set (disabled or enabled)
- [from_around_mol_cond] = generate using sampled molecule's condition (disabled or enabled)
- [from_training_space] = generate limited from the training space (disabled or enabled)
- For other setups related to number of molecules, task-specific validity, and generation parameters, please directly edit the
generation.pyfile - The result will be saved in
dataset/[pretrained_dataset]/output_[seed]_[pretrained_dataset_folder]folder
- Generate with optimization
- Run the following command
python3 optimization.py \
--file [dataset] \
--pretrained_file [finetuned_dataset] \
--pretrained_time [finetuned_dataset_folder] \
--opt_strategy [opt_strategy] \- [dataset] = name of dataset without
.csvextension - [pretrained_dataset] = name of the fine-tuned dataset without
.csvextension - [pretrained_dataset_folder] = name of the fine-tuned dataset sub-folder (without
output_[seed]) - [opt_strategy] = optimization strategy (at_random', 'around_target')
- For other setups related to number of molecules, objective function, and optimization parameters, please directly edit the
optimization.pyfile - The result will be saved in
dataset/[pretrained_dataset]/output_[seed]_[pretrained_dataset_folder]folder
- Kengkanna A., Kikuchi Y., Niwa T., Ohue M. Reaction-conditioned generative model for catalyst design and optimization with CatDRX. Communications Chemistry, 8: 314, 2025. doi: 10.1038/s42004-025-01732-7