pyProCT is a clustering framework designed to analyze large ensembles of protein conformations, with a strong focus on proteinโprotein docking and structural similarity clustering.
This repository is a Python 3 compatible fork of the original pyProCT project, preserving its original philosophy while updating the codebase to work with modern Python, NumPy, SciPy, and Cython.
pyProCT is a modular framework that:
- Computes distance matrices between structures (typically L-RMSD)
- Applies multiple clustering algorithms
- Evaluates cluster quality using different metrics
- Selects the best clustering automatically
- Generates postprocessing outputs (clusters, representatives, statistics)
Originally designed for docking decoy analysis, it is still very well suited for that purpose.
Key differences:
-
โ Python 3.9+ compatible
-
โ
pyRMSDis no longer a mandatory dependency -
๐ง Scheduler, analysis pipeline and postprocessing loader were fixed
-
๐ง Cython extensions were updated and recompiled:
- DBSCAN
- Spectral clustering
-
๐งฎ NumPy deprecations fixed:
np.floatโfloat / np.float64np.intโint / np.int64
-
๐ SciPy API updated:
eigvalsโsubset_by_index
-
๐ JSON schemas slightly clarified (parameter names matter)
The goal of this fork is functionality and reproducibility, not feature expansion.
- Python โฅ 3.9
- NumPy
- SciPy
- Cython
- matplotlib (optional, for plots)
conda create -n pyproct python=3.10
conda activate pyproct
pip install psutil numpy scipy cython matplotlibClone the repository:
git clone https://github.com/<your-user>/pyproct-python3.git
cd pyproct-python3python pyproct/clustering/algorithms/dbscan/cython/setup.py build_ext --inplacepython pyproct/clustering/algorithms/spectral/cython/setup.py build_ext --inplaceVerify:
python -c "import pyproct.clustering.algorithms.dbscan.cython.cythonDbscanTools"
python -c "import pyproct.clustering.algorithms.spectral.cython.spectralTools"pip install -e .python -m pyproct.main config.jsonWhere config.json defines:
- input structures
- clustering algorithms
- evaluation criteria
- postprocessing actions
pyProCT typically works with condensed distance matrices (as in SciPy).
In docking applications, distances usually represent L-RMSD (ร ).
Typical observed ranges:
min โ 0.7 ร
median โ 50 ร
p95 โ 80 ร
max โ 85 ร
This scale is important when choosing clustering parameters.
| Algorithm | Status | Notes |
|---|---|---|
| gromos | โ Stable | Recommended for docking |
| dbscan | โ Stable | Parameter sensitive |
| kmedoids | โ Stable | Requires K |
| hierarchical | โ Stable | Cutoff critical |
| spectral | โ Stable | Computationally expensive |
| random | For comparison only |
"gromos": {
"parameters": [
{ "cutoff": 4.0 },
{ "cutoff": 6.0 },
{ "cutoff": 8.0 }
]
}-
cutoff= maximum RMSD (ร ) to consider two structures neighbors -
Typical values:
- 2โ4 ร : very strict
- 6โ8 ร : flexible docking
"dbscan": {
"parameters": [
{ "eps": 10.0, "minpts": 2 },
{ "eps": 15.0, "minpts": 2 },
{ "eps": 20.0, "minpts": 2 }
]
}Interpretation (important):
eps= maximum RMSD distance (ร )minpts= minimum number of neighbors to form a cluster
If eps is too small โ 0 clusters
If eps is large โ fewer, larger clusters
"kmedoids": {
"parameters": [
{ "k": 5 },
{ "k": 10 },
{ "k": 20 }
]
}- Requires knowing approximately how many clusters you expect
- Very stable algorithm
"hierarchical": {
"parameters": [
{ "method": "average", "cutoff": 6.0 },
{ "method": "average", "cutoff": 8.0 },
{ "method": "average", "cutoff": 10.0 }
]
}Notes:
averageis usually better thancompletefor RMSD- Very sensitive to
cutoff - Can generate many singletons if cutoff is small
"spectral": {
"parameters": [
{ "max_clusters": 10 },
{ "max_clusters": 20 }
],
"force_sparse": false
}- More expensive than other methods
- Useful for non-convex cluster shapes
- Requires well-scaled distance matrices
"random": {
"parameters": [
{ "num_of_clusters": 2 },
{ "num_of_clusters": 5 }
]
}- Not a real clustering algorithm
- Useful as a baseline for evaluation metrics
For docking applications, Silhouette and Cohesion are the most informative.
Example:
"evaluation": {
"evaluation_criteria": {
"criteria_0": {
"Silhouette": {
"action": ">",
"weight": 1
}
}
},
"maximum_noise": 30,
"minimum_cluster_size": 1
}Notes:
- Silhouette can be
NaNfor 1-cluster solutions (this is expected) - Some algorithms may generate valid clusterings that are later rejected by evaluation filters
Valid postprocessing actions:
| KEYWORD | Description |
|---|---|
| representatives | representative structures |
| clusters | PDB files per cluster |
| cluster_stats | per-cluster statistics |
| rmsf | RMSF per cluster |
| centers_and_trace | cluster centers and trajectories |
| compression | redundancy elimination |
pdb_clusters was replaced by clusters.
- DBSCAN may legitimately return zero clusters for some parameters
- Hierarchical clustering can generate many singletons
- Spectral clustering is sensitive to matrix scaling
- Random clustering is not meaningful scientifically
- Not all โImproductive clustering searchโ messages indicate a bug
- Start with GROMOS
- Add DBSCAN with increasing
eps - Use Silhouette as main selection criterion
- Inspect cluster representatives visually
- Use hierarchical only for exploratory analysis
Original pyProCT paper:
If you plan to use pyProCT or any of its parts, including its documentation, to write a scientific article,
please consider to add the following cite:
J. Chem. Theory Comput., 2014, 10 (8), pp 3236โ3243
This fork provides Python 3 compatibility and maintenance fixes, but does not change the scientific methodology.