-
Notifications
You must be signed in to change notification settings - Fork 41
Description
Describe the bug
I am currently unable to reproduce the top-k SMILES metric MPN model results from the MolPAL paper using the code in this github repo.
Tables S1-S3 list the following %s: 59.6 +/- 2.3 (10k library), 68.8 +/- 1.0 (50k) library. My current runs repeatedly yield worse metrics: ~44 +/- 3.7 (10k library), and 64.8 +/- 1.7 (50k library). I haven't been able to run the HTS library yet, due to excessive compute time. However, these metrics differ noticeably from the manuscript.
I have confirmed that the molpal library/my conda environment reproduce the reported RF and NN results. This makes me suspect there could be something new to the current molpal MPN code driving the issue.
Example(s)
I have created shell scripts to run molpal using the MPN model over five runs, as performed in the paper. An example is shown below for the 10k library. An analogous script launches the runs using the 50k library.
rm -r ./molpal_10k/
rm -r ./our_runs/
mkdir our_runs
python run.py --config examples/config/Enamine10k_retrain.ini --metric greedy --init-size 0.01 --batch-sizes 0.01 --model mpn
mv molpal_10k/ our_runs/run1/
rm -r ./molpal_10k/
python run.py --config examples/config/Enamine10k_retrain.ini --metric greedy --init-size 0.01 --batch-sizes 0.01 --model mpn
mv molpal_10k/ our_runs/run2/
rm -r ./molpal_10k/
python run.py --config examples/config/Enamine10k_retrain.ini --metric greedy --init-size 0.01 --batch-sizes 0.01 --model mpn
mv molpal_10k/ our_runs/run3/
rm -r ./molpal_10k/
python run.py --config examples/config/Enamine10k_retrain.ini --metric greedy --init-size 0.01 --batch-sizes 0.01 --model mpn
mv molpal_10k/ our_runs/run4/
rm -r ./molpal_10k/
python run.py --config examples/config/Enamine10k_retrain.ini --metric greedy --init-size 0.01 --batch-sizes 0.01 --model mpn
mv molpal_10k/ our_runs/run5/
I am currently parsing the results using the fraction of top-k SMILES identified metric. Below is the script I am using to parse results.
import pandas as pd
import numpy as np
def compare(df1, df2):
s1, s2 = list(df1.smiles), list(df2.smiles)[:100]
dups = list( set(s1) & set(s2) )
print('number of top scoring compounds found:', len(dups) )
print('total number of ligands explored:', len(s1), len(set(list(s1) )) )
print('number of unique ligands recorded in top:', len( list(set(s2))))
print(' ')
print(' ')
return len(dups)
dfm = pd.read_csv('./data/Enamine10k_scores.csv.gz',index_col=None, compression='gzip')
files = ['./our_runs/run2/data/top_636_explored_iter_5.csv',\
'./our_runs/run3/data/top_636_explored_iter_5.csv', \
'./our_runs/run4/data/top_636_explored_iter_5.csv', \
'./our_runs/run5/data/top_636_explored_iter_5.csv',\
'./our_runs/run6/data/top_636_explored_iter_5.csv']
num=[]
for ff in files:
df1 = pd.read_csv(ff,index_col=None)
num+= [compare(df1, dfm)/100.]
num = np.array(num)
print("statistics:", np.mean(num), np.std(num) )
Here are some example outputs I have obtained
number of top scoring compounds found: 39
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100
number of top scoring compounds found: 45
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100
number of top scoring compounds found: 43
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100
number of top scoring compounds found: 50
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100
number of top scoring compounds found: 45
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100
statistics: 0.44400000000000006 0.035552777669262355
I see a similar drop in performance for the 50k library, although closer to the reported stats in the paper.
number of top scoring compounds found: 334
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
number of top scoring compounds found: 323
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
number of top scoring compounds found: 325
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
number of top scoring compounds found: 309
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
number of top scoring compounds found: 328
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
statistics: 0.6476 0.016560193235587575
As mentioned, the code currently reproduces the RF and NN results. For example, I show the output from the 50k library when using the NN model.
number of top scoring compounds found: 347
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
number of top scoring compounds found: 341
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
number of top scoring compounds found: 346
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
number of top scoring compounds found: 361
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
number of top scoring compounds found: 352
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
statistics: 0.6988 0.01354104870384859
This matches the 70% reported in the paper almost exactly, whereas the MPN model is off by a statistically significant amount.
Expected behavior
My current understanding of the Molpal library is that the MPN runs should reproduce the statistics reported in tables S1-S3 of the paper.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment
- python version - python 3.8.13
- package versions:
Package Version
---------------------------- ---------
absl-py 1.1.0
aiohttp 3.8.1
aiohttp-cors 0.7.0
aiosignal 1.2.0
astunparse 1.6.3
async-timeout 4.0.2
attrs 21.4.0
blessed 1.19.1
cachetools 5.2.0
certifi 2022.6.15
charset-normalizer 2.0.12
click 8.0.4
colorama 0.4.4
colorful 0.5.4
ConfigArgParse 1.5.3
cycler 0.11.0
distlib 0.3.4
filelock 3.7.1
flatbuffers 1.12
fonttools 4.33.3
frozenlist 1.3.0
fsspec 2022.5.0
gast 0.4.0
google-api-core 2.8.1
google-auth 2.7.0
google-auth-oauthlib 0.4.6
google-pasta 0.2.0
googleapis-common-protos 1.56.2
gpustat 1.0.0b1
greenlet 1.1.2
grpcio 1.43.0
h5py 3.7.0
idna 3.3
importlib-metadata 4.11.4
importlib-resources 5.7.1
joblib 1.1.0
jsonschema 4.6.0
keras 2.9.0
Keras-Preprocessing 1.1.2
kiwisolver 1.4.2
libclang 14.0.1
Markdown 3.3.7
matplotlib 3.5.2
msgpack 1.0.4
multidict 6.0.2
munkres 1.1.4
numpy 1.22.4
nvidia-ml-py3 7.352.0
oauthlib 3.2.0
opencensus 0.9.0
opencensus-context 0.1.2
OpenMM 7.7.0
opt-einsum 3.3.0
packaging 21.3
pandas 1.4.2
pdbfixer 1.8.1
Pillow 9.1.1
pip 22.1.2
platformdirs 2.5.2
prometheus-client 0.13.1
protobuf 3.19.4
psutil 5.9.1
py-spy 0.3.12
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycairo 1.21.0
pyDeprecate 0.3.2
pyparsing 3.0.9
pyrsistent 0.18.1
pyscreener 1.2.2
python-dateutil 2.8.2
pytorch-lightning 1.6.4
pytz 2022.1
PyYAML 6.0
ray 1.13.0
reportlab 3.5.68
requests 2.28.0
requests-oauthlib 1.3.1
rsa 4.8
scikit-learn 1.1.1
scipy 1.8.1
setuptools 62.3.3
six 1.16.0
smart-open 6.0.0
SQLAlchemy 1.4.37
tabulate 0.8.9
tensorboard 2.9.1
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tensorboardX 2.5.1
tensorflow 2.9.1
tensorflow-addons 0.17.1
tensorflow-estimator 2.9.0
tensorflow-io-gcs-filesystem 0.26.0
termcolor 1.1.0
threadpoolctl 3.1.0
torch 1.11.0
torchmetrics 0.9.1
tqdm 4.64.0
typeguard 2.13.3
typing_extensions 4.2.0
unicodedata2 14.0.0
urllib3 1.26.9
virtualenv 20.14.1
wcwidth 0.2.5
Werkzeug 2.1.2
wheel 0.37.1
wrapt 1.14.1
yarl 1.7.2
zipp 3.8.0
- OS - linux
Checklist
- all dependencies are satisifed:
conda listshows the packages listed in theREADME
I believe so. - the unit tests are working:
pytest -vreports no errors
I do see some errors here, but they don't appear relevant. I can import rdkit without an issue.
_____________________________________________________________________________________________________________ ERROR collecting tests/test_acquirer.py ______________________________________________________________________________________________________________
ImportError while importing test module '/EBS/MOLPAL_TESTS/molpal_ffn_50k/tests/test_acquirer.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/home/ubuntu/anaconda3/lib/python3.7/importlib/__init__.py:127: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
tests/test_acquirer.py:8: in <module>
from molpal.acquirer import Acquirer
molpal/__init__.py:1: in <module>
from .explorer import Explorer
molpal/explorer.py:13: in <module>
from molpal import acquirer, featurizer, models, objectives, pools
molpal/featurizer.py:10: in <module>
import rdkit.Chem.rdMolDescriptors as rdmd
E ModuleNotFoundError: No module named 'rdkit'
____________________________________________________________________________________________________________ ERROR collecting tests/test_featurizer.py _____________________________________________________________________________________________________________
ImportError while importing test module '/EBS/MOLPAL_TESTS/molpal_ffn_50k/tests/test_featurizer.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/home/ubuntu/anaconda3/lib/python3.7/importlib/__init__.py:127: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
tests/test_featurizer.py:6: in <module>
from rdkit import Chem
E ModuleNotFoundError: No module named 'rdkit'
______________________________________________________________________________________________________________ ERROR collecting tests/test_lookup.py _______________________________________________________________________________________________________________
ImportError while importing test module '/EBS/MOLPAL_TESTS/molpal_ffn_50k/tests/test_lookup.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/home/ubuntu/anaconda3/lib/python3.7/importlib/__init__.py:127: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
tests/test_lookup.py:10: in <module>
from molpal.objectives.lookup import LookupObjective
molpal/__init__.py:1: in <module>
from .explorer import Explorer
molpal/explorer.py:13: in <module>
from molpal import acquirer, featurizer, models, objectives, pools
molpal/featurizer.py:10: in <module>
import rdkit.Chem.rdMolDescriptors as rdmd
E ModuleNotFoundError: No module named 'rdkit'
_______________________________________________________________________________________________________________ ERROR collecting tests/test_pool.py ________________________________________________________________________________________________________________
ImportError while importing test module '/EBS/MOLPAL_TESTS/molpal_ffn_50k/tests/test_pool.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/home/ubuntu/anaconda3/lib/python3.7/importlib/__init__.py:127: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
tests/test_pool.py:4: in <module>
from molpal.pools.base import MoleculePool
molpal/__init__.py:1: in <module>
from .explorer import Explorer
molpal/explorer.py:13: in <module>
from molpal import acquirer, featurizer, models, objectives, pools
molpal/featurizer.py:10: in <module>
import rdkit.Chem.rdMolDescriptors as rdmd
E ModuleNotFoundError: No module named 'rdkit'
Additional context
Add any other context about the problem here.