Skip to content

๐Ÿ”ฌ pyProCT (Python 3 fork) brings the original pyProCT clustering toolkit back to life under modern Python. โš™๏ธ This version fixes Python 2 legacy issues, updates Cython extensions, adapts to recent NumPy/SciPy APIs, and ensures all clustering algorithms and postprocessing actions work end-to-end.

License

Notifications You must be signed in to change notification settings

pyDock/pyProCT

ย 
ย 

Repository files navigation

pyProCT (Python 3 fork)

pyProCT is a clustering framework designed to analyze large ensembles of protein conformations, with a strong focus on proteinโ€“protein docking and structural similarity clustering.

This repository is a Python 3 compatible fork of the original pyProCT project, preserving its original philosophy while updating the codebase to work with modern Python, NumPy, SciPy, and Cython.


1. What is pyProCT?

pyProCT is a modular framework that:

  • Computes distance matrices between structures (typically L-RMSD)
  • Applies multiple clustering algorithms
  • Evaluates cluster quality using different metrics
  • Selects the best clustering automatically
  • Generates postprocessing outputs (clusters, representatives, statistics)

Originally designed for docking decoy analysis, it is still very well suited for that purpose.


2. Important differences from the original pyProCT

โš ๏ธ This fork is not a drop-in replacement of the original repository.

Key differences:

  • โœ… Python 3.9+ compatible

  • โŒ pyRMSD is no longer a mandatory dependency

  • ๐Ÿ”ง Scheduler, analysis pipeline and postprocessing loader were fixed

  • ๐Ÿง  Cython extensions were updated and recompiled:

    • DBSCAN
    • Spectral clustering
  • ๐Ÿงฎ NumPy deprecations fixed:

    • np.float โ†’ float / np.float64
    • np.int โ†’ int / np.int64
  • ๐Ÿ“ SciPy API updated:

    • eigvals โ†’ subset_by_index
  • ๐Ÿ“„ JSON schemas slightly clarified (parameter names matter)

The goal of this fork is functionality and reproducibility, not feature expansion.


3. Installation (Python 3)

Requirements

  • Python โ‰ฅ 3.9
  • NumPy
  • SciPy
  • Cython
  • matplotlib (optional, for plots)
conda create -n pyproct python=3.10
conda activate pyproct
pip install psutil numpy scipy cython matplotlib

Clone the repository:

git clone https://github.com/<your-user>/pyproct-python3.git
cd pyproct-python3

Compile Cython extensions

DBSCAN

python pyproct/clustering/algorithms/dbscan/cython/setup.py build_ext --inplace

Spectral

python pyproct/clustering/algorithms/spectral/cython/setup.py build_ext --inplace

Verify:

python -c "import pyproct.clustering.algorithms.dbscan.cython.cythonDbscanTools"
python -c "import pyproct.clustering.algorithms.spectral.cython.spectralTools"

Install the pyProCT and ProDy dependencies

pip install -e .

4. Quick start

python -m pyproct.main config.json

Where config.json defines:

  • input structures
  • clustering algorithms
  • evaluation criteria
  • postprocessing actions

5. Distance matrices

pyProCT typically works with condensed distance matrices (as in SciPy).

In docking applications, distances usually represent L-RMSD (ร…).

Typical observed ranges:

min โ‰ˆ 0.7 ร…
median โ‰ˆ 50 ร…
p95 โ‰ˆ 80 ร…
max โ‰ˆ 85 ร…

This scale is important when choosing clustering parameters.


6. Clustering algorithms

โœ… Supported and tested algorithms

Algorithm Status Notes
gromos โœ… Stable Recommended for docking
dbscan โœ… Stable Parameter sensitive
kmedoids โœ… Stable Requires K
hierarchical โœ… Stable Cutoff critical
spectral โœ… Stable Computationally expensive
random โš ๏ธ Baseline For comparison only

6.1 GROMOS (recommended)

"gromos": {
  "parameters": [
    { "cutoff": 4.0 },
    { "cutoff": 6.0 },
    { "cutoff": 8.0 }
  ]
}
  • cutoff = maximum RMSD (ร…) to consider two structures neighbors

  • Typical values:

    • 2โ€“4 ร…: very strict
    • 6โ€“8 ร…: flexible docking

6.2 DBSCAN

"dbscan": {
  "parameters": [
    { "eps": 10.0, "minpts": 2 },
    { "eps": 15.0, "minpts": 2 },
    { "eps": 20.0, "minpts": 2 }
  ]
}

Interpretation (important):

  • eps = maximum RMSD distance (ร…)
  • minpts = minimum number of neighbors to form a cluster

If eps is too small โ†’ 0 clusters If eps is large โ†’ fewer, larger clusters


6.3 K-Medoids

"kmedoids": {
  "parameters": [
    { "k": 5 },
    { "k": 10 },
    { "k": 20 }
  ]
}
  • Requires knowing approximately how many clusters you expect
  • Very stable algorithm

6.4 Hierarchical clustering

"hierarchical": {
  "parameters": [
    { "method": "average", "cutoff": 6.0 },
    { "method": "average", "cutoff": 8.0 },
    { "method": "average", "cutoff": 10.0 }
  ]
}

Notes:

  • average is usually better than complete for RMSD
  • Very sensitive to cutoff
  • Can generate many singletons if cutoff is small

6.5 Spectral clustering

"spectral": {
  "parameters": [
    { "max_clusters": 10 },
    { "max_clusters": 20 }
  ],
  "force_sparse": false
}
  • More expensive than other methods
  • Useful for non-convex cluster shapes
  • Requires well-scaled distance matrices

6.6 Random (baseline)

"random": {
  "parameters": [
    { "num_of_clusters": 2 },
    { "num_of_clusters": 5 }
  ]
}
  • Not a real clustering algorithm
  • Useful as a baseline for evaluation metrics

7. Evaluation criteria

For docking applications, Silhouette and Cohesion are the most informative.

Example:

"evaluation": {
  "evaluation_criteria": {
    "criteria_0": {
      "Silhouette": {
        "action": ">",
        "weight": 1
      }
    }
  },
  "maximum_noise": 30,
  "minimum_cluster_size": 1
}

Notes:

  • Silhouette can be NaN for 1-cluster solutions (this is expected)
  • Some algorithms may generate valid clusterings that are later rejected by evaluation filters

8. Postprocessing actions (KEYWORD list)

Valid postprocessing actions:

KEYWORD Description
representatives representative structures
clusters PDB files per cluster
cluster_stats per-cluster statistics
rmsf RMSF per cluster
centers_and_trace cluster centers and trajectories
compression redundancy elimination

โš ๏ธ Note: pdb_clusters was replaced by clusters.


9. Known limitations

  • DBSCAN may legitimately return zero clusters for some parameters
  • Hierarchical clustering can generate many singletons
  • Spectral clustering is sensitive to matrix scaling
  • Random clustering is not meaningful scientifically
  • Not all โ€œImproductive clustering searchโ€ messages indicate a bug

10. Recommended workflow for docking

  1. Start with GROMOS
  2. Add DBSCAN with increasing eps
  3. Use Silhouette as main selection criterion
  4. Inspect cluster representatives visually
  5. Use hierarchical only for exploratory analysis

11. Citation

Original pyProCT paper:

If you plan to use pyProCT or any of its parts, including its documentation, to write a scientific article, please consider to add the following cite:
J. Chem. Theory Comput., 2014, 10 (8), pp 3236โ€“3243

This fork provides Python 3 compatibility and maintenance fixes, but does not change the scientific methodology.

About

๐Ÿ”ฌ pyProCT (Python 3 fork) brings the original pyProCT clustering toolkit back to life under modern Python. โš™๏ธ This version fixes Python 2 legacy issues, updates Cython extensions, adapts to recent NumPy/SciPy APIs, and ensures all clustering algorithms and postprocessing actions work end-to-end.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.9%
  • Cython 1.1%