Fast look-up of molecules.
This work is provided on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the work and assume any risks associated with your exercise of permissions under the License.
- Based on the ideas of: Medina & White (2023), Bloom filters for molecules, Journal of Cheminformatics 15:1.
- Data from SureChEMBL, CC Attribution-ShareAlike 3.0 Unported: Papadatos_et al._ (2016). SureChEMBL: a large-scale, chemically annotated patent document database, Nucleic Acids Research Database Issue, 44.
A Bloom filter is included in this repository, created from the quartery SureChEMBL dump (filter created in early Dec 2024).
With the Rust cargo build tool installed you can run the progam and paste in SMILES values:
cargo build --release ./target/release/molbloom -f model/surechembl_smiles_2024-12-05.bin query
CCC(=O)C1(CC1)C(=O)OC
true
O=C(C)Oc1ccccc1C(=O)O
false
Or you can give the program a file, one string per line.
To build the filter, asusming you have the data (see this blog post for an example), you can construct a new filter targetting a 1% false positive rate with:
/target/release/molbloom -f model/surechembl_smiles_2024-12-05.bin \
build --fpr 0.01 --num-items 23465171 < smiles.txt
Based on the example in Bloom & White, the FPR was calculated by building a filter on one half of the SureChEMBL SMILES data and testing on the other half. Each half roughly 11.7m molecules, using the SipHash-1-3 hasher:
