Protonation state prediction and microstate distribution for molecules at a given pH.
Given a SMILES string and pH value(s), this tool:
- Enumerates all possible protonation states
- Optionally enumerates tautomers
- Predicts free energy for each microstate using a neural network
- Calculates Boltzmann distribution at the given pH
- Returns the most probable protonated SMILES
# Clone the repository
git clone https://github.com/pauling-ai/Protonify.git
cd Protonify
# Build the wheel
pip install build
python -m build --wheel
# Install the package
pip install dist/protonify-0.1.0-py3-none-any.whlNote: Editable installs (pip install -e .) require setuptools with PEP 660 support.
The model weights (~500MB) must be downloaded separately:
# Using the provided script
chmod +x download_model.sh
./download_model.sh
# Or manually download from GitHub Releases and place in protonify/models/If you prefer to use Docker, follow these steps:
# 1. Clone the repository
git clone https://github.com/pauling-ai/Protonify.git
cd Protonify
# 2. Download the model weights
chmod +x download_model.sh
./download_model.sh
# 3. Build the Docker image
chmod +x build_docker.sh
./build_docker.sh
# Or directly: docker build -t protonify:latest .
# 4. Run predictions
docker run --rm protonify --smiles "CCO" --ph 7.4 --template smart
# Multiple pH values
docker run --rm protonify --smiles "NCC(=O)O" --ph "2.0,7.4,10.0" --template smart
# Quiet mode (only SMILES output)
docker run --rm protonify --smiles "CC(=O)O" --ph 7.4 --template smart -q
# Verify installation
protonify --help
# Basic prediction
protonify --smiles "CCO" --ph 7.4 --template smart# Basic usage (model auto-downloads on first run)
protonify --smiles "CCO" --ph 7.4 --template smart
# Force CPU (GPU is used by default when available)
protonify --smiles "CCO" --ph 7.4 --template smart --cpu
# Multiple pH values
protonify --smiles "CCO" --ph "7.0,7.4,8.0" --template smart
# Enable tautomer enumeration (slower but more thorough)
protonify --smiles "CCO" --ph 7.4 --template smart --enumerate-tautomers
# Use custom model
protonify --smiles "CCO" --ph 7.4 --template smart --model-path /path/to/model.pt
# Quiet mode (only output SMILES, useful for pipelines)
protonify --smiles "CCO" --ph 7.4 --template smart -q| Argument | Required | Description |
|---|---|---|
--smiles |
Yes | SMILES string of the molecule to analyze |
--ph |
Yes | pH value(s), single or comma-separated (e.g., "7.4" or "7.0,7.4,8.0") |
--template |
Yes | Template type: simple or smart |
--model-path |
No | Path to custom model (auto-downloads default model if not specified) |
--enumerate-tautomers |
No | Enable tautomer enumeration (disabled by default for speed) |
--cpu |
No | Force CPU inference (GPU is used by default when available) |
--quiet, -q |
No | Suppress verbose output, only print final SMILES (useful for pipelines) |
For each pH value, the tool outputs:
- Most probable microstate: The SMILES of the most likely protonation state
- Charge: The formal charge of that microstate
- Probability: The fraction/probability of that microstate
- Full distribution: All microstates with their probabilities
from protonify import predict_protonation
# One-line prediction (model auto-downloads on first use)
result = predict_protonation("CCO", ph=7.4)
print(result["smiles"]) # Most probable protonated SMILES
print(result["charge"]) # Formal charge
print(result["probability"]) # Probability
# Multiple pH values
results = predict_protonation("CCO", ph=[7.0, 7.4, 8.0])
for r in results:
print(f"pH {r['ph']}: {r['smiles']} (charge={r['charge']})")
# Use simple template (faster) or smart template (default, more accurate)
result = predict_protonation("CCO", ph=7.4, template="simple")
result = predict_protonation("CCO", ph=7.4, template="smart")
# With tautomer enumeration (slower, more thorough)
result = predict_protonation("CCO", ph=7.4, skip_tautomers=False)
# Use custom model
result = predict_protonation("CCO", ph=7.4, model_path="/path/to/model.pt")
# Force CPU (GPU is used by default when available)
result = predict_protonation("CCO", ph=7.4, use_gpu=False)
# Verbose mode (show logging, disabled by default)
result = predict_protonation("CCO", ph=7.4, quiet=False)To verify the Python API works correctly, run the test script:
python test_api.pyThis runs several examples including basic predictions, multiple pH values, template comparison, and tautomer enumeration.
The default model (~500MB) is loaded using the following priority:
- Explicit path -
--model-pathargument ormodel_path=parameter - Environment variable -
PROTONIFY_MODEL_PATH=/path/to/model.pt - Bundled model - If installed with
pip install .and model is inprotonify/models/ - Cached model - Previously downloaded to
~/.cache/protonify/ - Auto-download - Downloads from GitHub Releases on first use
| Method | Description |
|---|---|
| Bundled | Model included in package installation |
| Auto-download | Model downloads automatically on first use |
| Custom path | Use --model-path or model_path= argument |
| Environment variable | Set PROTONIFY_MODEL_PATH=/path/to/model.pt |
from protonify import download_model
download_model() # Downloads to ~/.cache/protonify/Download the model weights from GitHub Releases and place in ~/.cache/protonify/ or set PROTONIFY_MODEL_PATH.
By default, tautomer enumeration is disabled for speed. This significantly reduces computation time while still enumerating all protonation states.
For more thorough analysis (e.g., when accuracy is critical), enable tautomer enumeration:
# CLI
protonify --smiles "CCO" --ph 7.4 --template smart --enumerate-tautomers
# Python
result = predict_protonation("CCO", ph=7.4, skip_tautomers=False)- torch >= 2.0.0
- numpy >= 1.21
- pandas >= 1.3
- scipy >= 1.7
- rdkit >= 2023.0.0
If you get an error like Failed to download model: HTTP Error 404: Not Found, the automatic download failed. Solutions:
-
Manual download: Download the model from GitHub Releases and place it in
~/.cache/protonify/ -
Set environment variable:
export PROTONIFY_MODEL_PATH="/path/to/t_dwar_v_novartis_a_b.pt"
If the CLI works but the Python API fails with a model error, you likely have multiple Python installations with protonify installed in different locations.
Symptoms:
protonify --smiles "CCO" --ph 7.4 --template smartworkspython -c "from protonify import predict_protonation; predict_protonation('CCO', ph=7.4)"fails with model not found
Cause: The CLI uses one Python installation (with the bundled model) while your script uses another (without the model).
Solution: Set the environment variable pointing to the existing model:
# Find where the model is installed
find ~/.local /usr -name "t_dwar_v_novartis_a_b.pt" 2>/dev/null
# Set the environment variable (add to ~/.bashrc for persistence)
export PROTONIFY_MODEL_PATH="/path/found/above/t_dwar_v_novartis_a_b.pt"Alternative: Install protonify in your active Python environment:
# Make sure you're in the correct environment
which python # Verify this is your intended Python
# Reinstall protonify
pip uninstall protonify
pip install protonifyRun this to check if everything is working:
from protonify import predict_protonation
result = predict_protonation("CCO", ph=7.4)
print(f"Success! Result: {result['smiles']}")Protonify is a wrapper/interface built on top of the original UniPKa project by DP Technology Corp.
Key contribution: UniPKa visualizes the distribution of microstates vs pH in graphs, but does not offer a direct interface to obtain the most probable SMILES at a specific pH. Protonify adds this functionality: an API and CLI that directly return the most probable protonation state for integration into automated pipelines.
The core prediction model and methodology are from UniPKa - this project adds:
- pH-to-SMILES interface: Input pH, get the most probable protonated SMILES
- Simplified CLI interface
- Python API for easy integration
- Automatic model downloading
Developed by: Pablo Villanueva Cuñado (@PabloPauling) at Pauling AI
This project is based entirely on UniPKa developed by DP Technology Corp. All credit for the scientific methodology and neural network architecture goes to the original authors.
- Uni-pKa: Neural network-based pKa prediction using Uni-Mol architecture
- DP Technology Corp: For developing and open-sourcing the foundational model and methodology
- Uni-Mol: The underlying molecular representation framework
We are grateful to DP Technology for making their work open-source.
If you use this software in your research, please cite both Protonify and the original Uni-pKa project:
- Uni-pKa repository: https://github.com/dptech-corp/Uni-pKa
@software{protonify2025,
author = {Villanueva Cuñado, Pablo},
title = {Protonify: Protonation State Prediction for Molecules},
year = {2025},
organization = {Pauling AI},
url = {https://github.com/pauling-ai/Protonify}
}
@article{unipka2024,
title={Bridging Machine Learning and Thermodynamics for Accurate pKa Prediction},
author={Zhou, Gengmo and others},
journal={JACS Au},
year={2024},
publisher={American Chemical Society},
url={https://github.com/dptech-corp/Uni-pKa}
}This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
This is an independent open-source project. It is not affiliated, associated, sponsored, or endorsed by Protonify Corporation, nor by any other entity with a similar name. Any similarity in naming is purely coincidental and does not imply any commercial relationship.