RhNet2

This repository contains code for "Neural Mulliken Analysis: Molecular Graphs from Density Matrices for QSPR on Raw Quantum-Chemical Data" paper (also available as a preprint with fewer tests).

Overview

Here, raw quantum-chemical data in molecular graph form is used as input for graph neural networks. Molecular graphs are derived from 1-electron density matrix and basis set overlap matrix.

Repo structure

🚧 🛠️ 🚧

Input Data

The three main data types the model works with are:

Electron structure information from quantum-chemical calculations (training data preparation and inference)
Molecular graphs derived from electron structure (for training only)
Tabular data e.g. containing target values, scalings etc. (for training data preparation only)

Quantum Chemical Data

Quantum chemical data are extracted from the results of quantum chemical calculations performed using the ORCA quantum chemistry program package. The computational workflow is described in detail in the corresponding paper and preprint.

To automatically generate ORCA inputs and perform quantum-chemical calculations in large batches, the repository includes the do_eldens.py script, which can be invoked as follows:

python3 do_eldens.py --data_csv data/train_data.csv --conf_dir data/train_geoms --rot_aug 50 --work_dir data/temp --orca_path $orcadir --pal 10 --dft_inp_path qchem/orca_inputs/pbe_def2svp_orca29 --dmt_inp_path qchem/orca_inputs/noiter_moread --basis_path qchem/basis_sets/ANOR0_Ar

--data_csv data/train_data.csv The path to a .csv file specifying a list of compounds to be calculated in the following format:

"isomer_id","nconf"
"153-61-7",10
"153-61-7_reflect",11
...

In this example calculation on 10 conformers of Cephalothin (CASRN 153-61-7) and on 11 conformers of its mirrored isomer is requested (corresponding conformers geometries should be already available)

--conf_dir data/train_geoms The path to a folder containing geometries in .xyz format. The filenames are expected to be <isomer_id>_<conf_id>.xyz e.g. 153-61-7_1.xyz.

--rot_aug 50 The number of randomly rotated electron density distributions of a particular conformer to be calculated.

--work_dir data/temp The path to a folder containing the results of quantum-chemical calculations (zipped ORCA logs).

--orca_path $orcadir The path to the ORCA executables.

--pal 10 The number of simultaneously performed ORCA calculations

--dft_inp_path qchem/orca_inputs/pbe_def2svp_orca29 The path to a regular ORCA calculation input template.

--dmt_inp_path qchem/orca_inputs/noiter_moread The path to an input template performing MOs projection using ORCA.

--basis_path qchem/basis_sets/ANOR0_Ar The path to a template with the basis set in which electron densities are to be expanded for model fitting.

Notes:

do_eldens.py doesn't perform conformer search or geometry optimization, these should be somehow done in advance.
Used here for electron density expansion, ANO-R0 basis set is not probably the best choice in terms of both accuracy and computational efficiency.

Graph Data

Density and overlap matrices are extracted from quantum-chemical calculations logs and are split into graph nodes and edges embeddings. E.g. diagonal density matrix blocks are used as atomic nodes embeddings, off-diagonal blocks provide embeddings for “link” nodes corresponding to atomic pairs. The overlap matrix is used in edge embeddings. Overlaps are used:

to encode structural information
as weights for pooling operations
provide access to the actual electronic charges (due to the element-wise multiplication performed while pooling)

For further details see the corresponding paper and preprint.

For inference, only nodes embeddings, edges embeddings, adjacency, etc. matrices in TF tensor format are needed, see Inference section.

For training, these should be further converted into TF-GNN GraphTensor format. To automatically convert ORCA outputs into GraphTensors and save them in .tfrecord format in large batches, the repository includes the do_tfrecords.py script, which can be invoked as follows:

python3 do_tfrecords.py --save_path data/tfrecords --record_name train --schema_path schema_template --data_csv data/train_data.csv --orca_outs data/temp --rot_aug 50 --overlap_thresh 0.035 --multi_target --scalings_csv scalings.csv --gepol_path <gepol_path>

--save_path data/tfrecords The general path where the tfrecords will be stored

--record_name train Data set name i.e. the actual path to the saved data is <save_path>/<record_name>

--schema_path schema_template The path to a graph schema template

--data_csv data/train_data.csv The path to a .csv file specifying a list of compounds to be included in the dataset in the following format:

"cas","isomer_id","target","nconf"
"153-61-7","153-61-7",2.83,10
"153-61-7","153-61-7_reflect",2.83,11
"90-15-3","90-15-3",4.02,2

In this example two records will be saved containing graphs corresponding to all (rotated) conformers and isomers of Cephalothin (CASRN 153-61-7) and to all (rotated) conformers of 1-Naphthol (CASRN 90-15-3) as 153-61-7.tfrecord and 90-15-3.tfrecord (records are grouped by cas column). target column contains the true values to be included in the saved records.

--orca_outs data/temp The path to the results of quantum-chemical calculations. The filename format is expected to be <isomer_id>_<conf_id>_<rot_id>.out or <isomer_id>_<conf_id>_<rot_id>.zip (containing <isomer_id>_<conf_id>_<rot_id>.out)

--rot_aug 50 The number of available randomly rotated electron density distributions of a particular conformer.

--overlap_thresh 0.035 Graph edges pruning threshold

--multi_target Whether to include in the records multiple targets if a model is intended to predict several properties. If invoked, a list of targets will be read from the schema template. The <data_csv> file should have the following format:

"cas","isomer_id","target","melt","logk","nconf"
"153-61-7","153-61-7",2.83,1.600,-0.410,10
"153-61-7","153-61-7_reflect",2.83,1.600,-0.410,11
"90-15-3","90-15-3",4.02,0.957,2.850,2

Here, melt and logk columns contain the true values for additional targets.

--scalings_csv scalings.csv Path to a .csv file specifying scaling factors to be applied to the target values of the following format:

"target","divisor"
"melt",100.0
"logk",1.0
"dipole",10.0
"vol",100.0
"surf",100.0

--gepol_path <gepol_path> Path to the GEPOL93 executables for molecular volume and surface calculation. Molecular volume and surface may be used as additional targets for regularization purposes.

Notes:

do_tfrecords.py doesn't perform quantum-chemical calculations or rotational augmentation, these should be done e.g. using do_eldens.py script in advance.
Aside from molecular volume and surface, the use of the electric dipole moment amplitude as an additional target is implemented. The necessary values will be automatically extracted from the calculation results if the schema template contains dipole key.

Tabular Data

Tabular data is used only to prepare the necessary quantum-chemical and graph data. Stored as .csv, it has the following format:

"cas","isomer_id","target","melt","logk","nconf"
"153-61-7","153-61-7",2.83,1.600,-0.410,10
"153-61-7","153-61-7_reflect",2.83,1.600,-0.410,11
"90-15-3","90-15-3",4.02,0.957,2.850,2

"cas" Available compounds IDs

"isomer_id" Available isomers IDs

"nconf" Number of available conformers per isomer

"target" True primary target values

"melt" True additional target values

"logk" True additional target values

Model Fit

fit_utils module provides several functions to train an RhNet2 model:

model_fit fits a model (optionally with validation), returns a trained model and losses history
ensemble_fit sequentially fits several models and saves them along with the losses history
kfold_val performs K-fold cross-validation, saves only the losses history
loo_val preforms Leave-One-Out cross-validation, saves trained models along with the losses history

The repository contains example scripts fit_example.py and loo_example.py showing model_fit and loo_val functions usage.

For model details see the corresponding paper and preprint.

Inference

For inference, graphs and TF-GNN module are not needed. The results of quantum chemical calculations can be directly parsed into suitable for inference TF tensors. Inference examples along with trained models are provided in separate repositories:

The Second Solubility Challenge (SC-2, link, link) solution on GitHub (a lighter model trained on a slightly bigger dataset, better suited to play with).
The First Solubility Challenge (SC-1, link, link) solution on GitHub and on Hugging Face (ensemble of heavier models trained on a tiny dataset, probably to heavy to play with).

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
data		data
images		images
qchem		qchem
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
check_orca_logs.py		check_orca_logs.py
custom_blocks.py		custom_blocks.py
data_utils.py		data_utils.py
do_eldens.py		do_eldens.py
do_tfrecords.py		do_tfrecords.py
fit_example.py		fit_example.py
fit_utils.py		fit_utils.py
loo_example.py		loo_example.py
model.py		model.py
qchem_data_utils.py		qchem_data_utils.py
qchem_run_utils.py		qchem_run_utils.py
requirements.txt		requirements.txt
scalings.csv		scalings.csv
schema_template		schema_template

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RhNet2

Contents

Overview

Repo structure

Input Data

Quantum Chemical Data

Graph Data

Tabular Data

Model Fit

Inference

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Shorku/rhnet2

Folders and files

Latest commit

History

Repository files navigation

RhNet2

Contents

Overview

Repo structure

Input Data

Quantum Chemical Data

Graph Data

Tabular Data

Model Fit

Inference

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages