Skip to content

[BUG]:  #39

@coparks2012

Description

@coparks2012

Describe the bug
I am currently unable to reproduce the top-k SMILES metric MPN model results from the MolPAL paper using the code in this github repo.

Tables S1-S3 list the following %s: 59.6 +/- 2.3 (10k library), 68.8 +/- 1.0 (50k) library. My current runs repeatedly yield worse metrics: ~44 +/- 3.7 (10k library), and 64.8 +/- 1.7 (50k library). I haven't been able to run the HTS library yet, due to excessive compute time. However, these metrics differ noticeably from the manuscript.

I have confirmed that the molpal library/my conda environment reproduce the reported RF and NN results. This makes me suspect there could be something new to the current molpal MPN code driving the issue.

Example(s)
I have created shell scripts to run molpal using the MPN model over five runs, as performed in the paper. An example is shown below for the 10k library. An analogous script launches the runs using the 50k library.

rm -r ./molpal_10k/
rm -r ./our_runs/
mkdir our_runs

python run.py --config examples/config/Enamine10k_retrain.ini --metric greedy --init-size 0.01 --batch-sizes 0.01 --model mpn
mv molpal_10k/ our_runs/run1/

rm -r ./molpal_10k/
python run.py --config examples/config/Enamine10k_retrain.ini --metric greedy --init-size 0.01 --batch-sizes 0.01 --model mpn
mv molpal_10k/ our_runs/run2/

rm -r ./molpal_10k/
python run.py --config examples/config/Enamine10k_retrain.ini --metric greedy --init-size 0.01 --batch-sizes 0.01 --model mpn
mv molpal_10k/ our_runs/run3/

rm -r ./molpal_10k/
python run.py --config examples/config/Enamine10k_retrain.ini --metric greedy --init-size 0.01 --batch-sizes 0.01 --model mpn
mv molpal_10k/ our_runs/run4/

rm -r ./molpal_10k/
python run.py --config examples/config/Enamine10k_retrain.ini --metric greedy --init-size 0.01 --batch-sizes 0.01 --model mpn
mv molpal_10k/ our_runs/run5/

I am currently parsing the results using the fraction of top-k SMILES identified metric. Below is the script I am using to parse results.

import pandas as pd
import numpy as np

def compare(df1, df2):
    s1, s2 = list(df1.smiles), list(df2.smiles)[:100]
    dups = list( set(s1) & set(s2) )
    print('number of top scoring compounds found:', len(dups) )
    print('total number of ligands explored:', len(s1), len(set(list(s1) )) )
    print('number of unique ligands recorded in top:', len( list(set(s2))))
    print(' ')
    print(' ')
    return len(dups)

dfm = pd.read_csv('./data/Enamine10k_scores.csv.gz',index_col=None, compression='gzip')

files = ['./our_runs/run2/data/top_636_explored_iter_5.csv',\
         './our_runs/run3/data/top_636_explored_iter_5.csv', \
         './our_runs/run4/data/top_636_explored_iter_5.csv', \
         './our_runs/run5/data/top_636_explored_iter_5.csv',\
         './our_runs/run6/data/top_636_explored_iter_5.csv']

num=[]
for ff in files:
    df1 = pd.read_csv(ff,index_col=None)
    num+= [compare(df1, dfm)/100.]

num = np.array(num)
print("statistics:", np.mean(num), np.std(num) )

Here are some example outputs I have obtained

number of top scoring compounds found: 39
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100
 
number of top scoring compounds found: 45
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100
 
number of top scoring compounds found: 43
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100
 
number of top scoring compounds found: 50
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100
 
number of top scoring compounds found: 45
total number of ligands explored: 636 636
number of unique ligands recorded in top: 100
 
statistics: 0.44400000000000006 0.035552777669262355

I see a similar drop in performance for the 50k library, although closer to the reported stats in the paper.

number of top scoring compounds found: 334
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
 
number of top scoring compounds found: 323
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
 
number of top scoring compounds found: 325
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
 
number of top scoring compounds found: 309
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
 
number of top scoring compounds found: 328
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
 
statistics: 0.6476 0.016560193235587575

As mentioned, the code currently reproduces the RF and NN results. For example, I show the output from the 50k library when using the NN model.

number of top scoring compounds found: 347
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
 
number of top scoring compounds found: 341
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
 
number of top scoring compounds found: 346
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
 
number of top scoring compounds found: 361
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
 
number of top scoring compounds found: 352
total number of ligands explored: 3018 3018
number of unique ligands recorded in top: 500
 
statistics: 0.6988 0.01354104870384859

This matches the 70% reported in the paper almost exactly, whereas the MPN model is off by a statistically significant amount.

Expected behavior
My current understanding of the Molpal library is that the MPN runs should reproduce the statistics reported in tables S1-S3 of the paper.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment

  • python version - python 3.8.13
  • package versions:
Package                      Version
---------------------------- ---------
absl-py                      1.1.0
aiohttp                      3.8.1
aiohttp-cors                 0.7.0
aiosignal                    1.2.0
astunparse                   1.6.3
async-timeout                4.0.2
attrs                        21.4.0
blessed                      1.19.1
cachetools                   5.2.0
certifi                      2022.6.15
charset-normalizer           2.0.12
click                        8.0.4
colorama                     0.4.4
colorful                     0.5.4
ConfigArgParse               1.5.3
cycler                       0.11.0
distlib                      0.3.4
filelock                     3.7.1
flatbuffers                  1.12
fonttools                    4.33.3
frozenlist                   1.3.0
fsspec                       2022.5.0
gast                         0.4.0
google-api-core              2.8.1
google-auth                  2.7.0
google-auth-oauthlib         0.4.6
google-pasta                 0.2.0
googleapis-common-protos     1.56.2
gpustat                      1.0.0b1
greenlet                     1.1.2
grpcio                       1.43.0
h5py                         3.7.0
idna                         3.3
importlib-metadata           4.11.4
importlib-resources          5.7.1
joblib                       1.1.0
jsonschema                   4.6.0
keras                        2.9.0
Keras-Preprocessing          1.1.2
kiwisolver                   1.4.2
libclang                     14.0.1
Markdown                     3.3.7
matplotlib                   3.5.2
msgpack                      1.0.4
multidict                    6.0.2
munkres                      1.1.4
numpy                        1.22.4
nvidia-ml-py3                7.352.0
oauthlib                     3.2.0
opencensus                   0.9.0
opencensus-context           0.1.2
OpenMM                       7.7.0
opt-einsum                   3.3.0
packaging                    21.3
pandas                       1.4.2
pdbfixer                     1.8.1
Pillow                       9.1.1
pip                          22.1.2
platformdirs                 2.5.2
prometheus-client            0.13.1
protobuf                     3.19.4
psutil                       5.9.1
py-spy                       0.3.12
pyasn1                       0.4.8
pyasn1-modules               0.2.8
pycairo                      1.21.0
pyDeprecate                  0.3.2
pyparsing                    3.0.9
pyrsistent                   0.18.1
pyscreener                   1.2.2
python-dateutil              2.8.2
pytorch-lightning            1.6.4
pytz                         2022.1
PyYAML                       6.0
ray                          1.13.0
reportlab                    3.5.68
requests                     2.28.0
requests-oauthlib            1.3.1
rsa                          4.8
scikit-learn                 1.1.1
scipy                        1.8.1
setuptools                   62.3.3
six                          1.16.0
smart-open                   6.0.0
SQLAlchemy                   1.4.37
tabulate                     0.8.9
tensorboard                  2.9.1
tensorboard-data-server      0.6.1
tensorboard-plugin-wit       1.8.1
tensorboardX                 2.5.1
tensorflow                   2.9.1
tensorflow-addons            0.17.1
tensorflow-estimator         2.9.0
tensorflow-io-gcs-filesystem 0.26.0
termcolor                    1.1.0
threadpoolctl                3.1.0
torch                        1.11.0
torchmetrics                 0.9.1
tqdm                         4.64.0
typeguard                    2.13.3
typing_extensions            4.2.0
unicodedata2                 14.0.0
urllib3                      1.26.9
virtualenv                   20.14.1
wcwidth                      0.2.5
Werkzeug                     2.1.2
wheel                        0.37.1
wrapt                        1.14.1
yarl                         1.7.2
zipp                         3.8.0
  • OS - linux

Checklist

  • all dependencies are satisifed: conda list shows the packages listed in the README
    I believe so.
  • the unit tests are working: pytest -v reports no errors
    I do see some errors here, but they don't appear relevant. I can import rdkit without an issue.
_____________________________________________________________________________________________________________ ERROR collecting tests/test_acquirer.py ______________________________________________________________________________________________________________
ImportError while importing test module '/EBS/MOLPAL_TESTS/molpal_ffn_50k/tests/test_acquirer.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/home/ubuntu/anaconda3/lib/python3.7/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_acquirer.py:8: in <module>
    from molpal.acquirer import Acquirer
molpal/__init__.py:1: in <module>
    from .explorer import Explorer
molpal/explorer.py:13: in <module>
    from molpal import acquirer, featurizer, models, objectives, pools
molpal/featurizer.py:10: in <module>
    import rdkit.Chem.rdMolDescriptors as rdmd
E   ModuleNotFoundError: No module named 'rdkit'
____________________________________________________________________________________________________________ ERROR collecting tests/test_featurizer.py _____________________________________________________________________________________________________________
ImportError while importing test module '/EBS/MOLPAL_TESTS/molpal_ffn_50k/tests/test_featurizer.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/home/ubuntu/anaconda3/lib/python3.7/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_featurizer.py:6: in <module>
    from rdkit import Chem
E   ModuleNotFoundError: No module named 'rdkit'
______________________________________________________________________________________________________________ ERROR collecting tests/test_lookup.py _______________________________________________________________________________________________________________
ImportError while importing test module '/EBS/MOLPAL_TESTS/molpal_ffn_50k/tests/test_lookup.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/home/ubuntu/anaconda3/lib/python3.7/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_lookup.py:10: in <module>
    from molpal.objectives.lookup import LookupObjective
molpal/__init__.py:1: in <module>
    from .explorer import Explorer
molpal/explorer.py:13: in <module>
    from molpal import acquirer, featurizer, models, objectives, pools
molpal/featurizer.py:10: in <module>
    import rdkit.Chem.rdMolDescriptors as rdmd
E   ModuleNotFoundError: No module named 'rdkit'
_______________________________________________________________________________________________________________ ERROR collecting tests/test_pool.py ________________________________________________________________________________________________________________
ImportError while importing test module '/EBS/MOLPAL_TESTS/molpal_ffn_50k/tests/test_pool.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/home/ubuntu/anaconda3/lib/python3.7/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_pool.py:4: in <module>
    from molpal.pools.base import MoleculePool
molpal/__init__.py:1: in <module>
    from .explorer import Explorer
molpal/explorer.py:13: in <module>
    from molpal import acquirer, featurizer, models, objectives, pools
molpal/featurizer.py:10: in <module>
    import rdkit.Chem.rdMolDescriptors as rdmd
E   ModuleNotFoundError: No module named 'rdkit'

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions