This repository contains code for "Neural Mulliken Analysis: Molecular Graphs from Density Matrices for QSPR on Raw Quantum-Chemical Data" paper (also available as a preprint with fewer tests).
Here, raw quantum-chemical data in molecular graph form is used as input for graph neural networks. Molecular graphs are derived from 1-electron density matrix and basis set overlap matrix.
🚧 🛠️ 🚧
The three main data types the model works with are:
- Electron structure information from quantum-chemical calculations (training data preparation and inference)
- Molecular graphs derived from electron structure (for training only)
- Tabular data e.g. containing target values, scalings etc. (for training data preparation only)
Quantum chemical data are extracted from the results of quantum chemical calculations performed using the ORCA quantum chemistry program package. The computational workflow is described in detail in the corresponding paper and preprint.
To automatically generate ORCA inputs and perform quantum-chemical calculations
in large batches, the repository includes the do_eldens.py script, which can
be invoked as follows:
python3 do_eldens.py --data_csv data/train_data.csv --conf_dir data/train_geoms --rot_aug 50 --work_dir data/temp --orca_path $orcadir --pal 10 --dft_inp_path qchem/orca_inputs/pbe_def2svp_orca29 --dmt_inp_path qchem/orca_inputs/noiter_moread --basis_path qchem/basis_sets/ANOR0_Ar --data_csv data/train_data.csv The path to a .csv file specifying a list of
compounds to be calculated in the following format:
"isomer_id","nconf"
"153-61-7",10
"153-61-7_reflect",11
...
In this example calculation on 10 conformers of Cephalothin (CASRN 153-61-7) and on 11 conformers of its mirrored isomer is requested (corresponding conformers geometries should be already available)
--conf_dir data/train_geoms The path to a folder containing geometries in
.xyz format. The filenames are expected to be <isomer_id>_<conf_id>.xyz
e.g. 153-61-7_1.xyz.
--rot_aug 50 The number of randomly rotated electron density distributions of
a particular conformer to be calculated.
--work_dir data/temp The path to a folder containing the results of
quantum-chemical calculations (zipped ORCA logs).
--orca_path $orcadir The path to the ORCA executables.
--pal 10 The number of simultaneously performed ORCA calculations
--dft_inp_path qchem/orca_inputs/pbe_def2svp_orca29 The path to a regular
ORCA calculation input template.
--dmt_inp_path qchem/orca_inputs/noiter_moread The path to an input template
performing MOs projection using ORCA.
--basis_path qchem/basis_sets/ANOR0_Ar The path to a template with the basis
set in which electron densities are to be expanded for model fitting.
Notes:
do_eldens.pydoesn't perform conformer search or geometry optimization, these should be somehow done in advance.- Used here for electron density expansion, ANO-R0 basis set is not probably the best choice in terms of both accuracy and computational efficiency.
Density and overlap matrices are extracted from quantum-chemical calculations logs and are split into graph nodes and edges embeddings. E.g. diagonal density matrix blocks are used as atomic nodes embeddings, off-diagonal blocks provide embeddings for “link” nodes corresponding to atomic pairs. The overlap matrix is used in edge embeddings. Overlaps are used:
- to encode structural information
- as weights for pooling operations
- provide access to the actual electronic charges (due to the element-wise multiplication performed while pooling)
For further details see the corresponding paper and preprint.
For inference, only nodes embeddings, edges embeddings, adjacency, etc. matrices in TF tensor format are needed, see Inference section.
For training, these should be further converted into TF-GNN GraphTensor format.
To automatically convert ORCA outputs into GraphTensors and save them
in .tfrecord format in large batches, the repository includes the
do_tfrecords.py script, which can be invoked as follows:
python3 do_tfrecords.py --save_path data/tfrecords --record_name train --schema_path schema_template --data_csv data/train_data.csv --orca_outs data/temp --rot_aug 50 --overlap_thresh 0.035 --multi_target --scalings_csv scalings.csv --gepol_path <gepol_path> --save_path data/tfrecords The general path where the tfrecords will be stored
--record_name train Data set name i.e. the actual path to the saved data is
<save_path>/<record_name>
--schema_path schema_template The path to a graph schema template
--data_csv data/train_data.csv The path to a .csv file specifying a list of
compounds to be included in the dataset in the following format:
"cas","isomer_id","target","nconf"
"153-61-7","153-61-7",2.83,10
"153-61-7","153-61-7_reflect",2.83,11
"90-15-3","90-15-3",4.02,2
In this example two records will be saved containing graphs corresponding to
all (rotated) conformers and isomers of Cephalothin (CASRN 153-61-7) and to
all (rotated) conformers of 1-Naphthol (CASRN 90-15-3) as 153-61-7.tfrecord and
90-15-3.tfrecord (records are grouped by cas column). target column
contains the true values to be included in the saved records.
--orca_outs data/temp The path to the results of quantum-chemical calculations.
The filename format is expected to be <isomer_id>_<conf_id>_<rot_id>.out or
<isomer_id>_<conf_id>_<rot_id>.zip
(containing <isomer_id>_<conf_id>_<rot_id>.out)
--rot_aug 50 The number of available randomly rotated electron density
distributions of a particular conformer.
--overlap_thresh 0.035 Graph edges pruning threshold
--multi_target Whether to include in the records multiple targets if a model
is intended to predict several properties. If invoked, a list of targets will
be read from the schema template. The <data_csv> file
should have the following format:
"cas","isomer_id","target","melt","logk","nconf"
"153-61-7","153-61-7",2.83,1.600,-0.410,10
"153-61-7","153-61-7_reflect",2.83,1.600,-0.410,11
"90-15-3","90-15-3",4.02,0.957,2.850,2
Here, melt and logk columns contain the true values for additional targets.
--scalings_csv scalings.csv Path to a .csv file specifying scaling factors
to be applied to the target values of the following format:
"target","divisor"
"melt",100.0
"logk",1.0
"dipole",10.0
"vol",100.0
"surf",100.0
--gepol_path <gepol_path> Path to the
GEPOL93 executables for molecular
volume and surface calculation.
Molecular volume and surface may be used as additional targets for
regularization purposes.
Notes:
do_tfrecords.pydoesn't perform quantum-chemical calculations or rotational augmentation, these should be done e.g. usingdo_eldens.pyscript in advance.- Aside from molecular volume and surface, the use of the electric dipole
moment amplitude as an additional target is implemented.
The necessary values will be automatically extracted from the calculation
results if the schema template contains
dipolekey.
Tabular data is used only to prepare the necessary quantum-chemical and graph
data. Stored as .csv, it has the following format:
"cas","isomer_id","target","melt","logk","nconf"
"153-61-7","153-61-7",2.83,1.600,-0.410,10
"153-61-7","153-61-7_reflect",2.83,1.600,-0.410,11
"90-15-3","90-15-3",4.02,0.957,2.850,2
"cas" Available compounds IDs
"isomer_id" Available isomers IDs
"nconf" Number of available conformers per isomer
"target" True primary target values
"melt" True additional target values
"logk" True additional target values
fit_utils module provides several functions to train an RhNet2 model:
model_fitfits a model (optionally with validation), returns a trained model and losses historyensemble_fitsequentially fits several models and saves them along with the losses historykfold_valperforms K-fold cross-validation, saves only the losses historyloo_valpreforms Leave-One-Out cross-validation, saves trained models along with the losses history
The repository contains example scripts fit_example.py and loo_example.py
showing model_fit and loo_val functions usage.
For model details see the corresponding paper and preprint.
For inference, graphs and TF-GNN module are not needed. The results of quantum chemical calculations can be directly parsed into suitable for inference TF tensors. Inference examples along with trained models are provided in separate repositories:
- The Second Solubility Challenge (SC-2, link, link) solution on GitHub (a lighter model trained on a slightly bigger dataset, better suited to play with).
- The First Solubility Challenge (SC-1, link, link) solution on GitHub and on Hugging Face (ensemble of heavier models trained on a tiny dataset, probably to heavy to play with).


