Skip to content

URF is a self-supervised version of traditional supervised-random-forest algorithm, facilitating the feature selection in protein biophysics for resolving protein's conformational representation.

Notifications You must be signed in to change notification settings

msahilgit/Unsupervised-Random-Forest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

324 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Unbiased learning of protein conformational representation via unsupervised random forest

Alt text

Accurate data representation is paramount in molecular dynamics simulations to capture the functionally relevant motions of proteins. Traditional feature selection methods, while effective, often rely on labeled data, limiting their applicability to novel systems. Here, we present unsupervised random forest (URF), a self-supervised adaptation of traditional random forests that identifies functionally critical features without requiring prior labels. URF-selected features highlight key functional regions, enabling the identification of important residues in diverse proteins. By implementing a memory-efficient version, we demonstrate URF's capability to resolve functional states in around 10 diverse systems, including folded and intrinsically disordered proteins, performing on par with or surpassing 16 leading baseline methods. Crucially, URF is guided by an internal metric, the learning coefficient, which automates hyper-parameter optimization, making the method robust and user-friendly. Benchmarking results reveal URF's distinct ability to produce functionally meaningful representations in comparison to previously reported methods, facilitating downstream analyses such as Markov state modeling . The investigation presented here establishes URF as a leading tool for unsupervised representation learning in protein biophysics.

Reference

this repository is implementation of URF protocol, corresponding to publication(ref.).

MAIN
├── URF : the unsupervised-random-forest module

├── data : scripts for data estimation from MD trajectories
│ ├── ASH1
│ ├── LJ polymer
│ ├── P450_binding
│ ├── P450_channel1
│ ├── SIC1
│ ├── T4L
│ ├── asyn
│ ├── mopR
│ ├── mopR_ensembles
│ ├── pASH1
│ └── pSIC1

├── scripts : scripts for reproducibility of results
│ ├── 0_python_modules
│ ├── ASH1
│ ├── LJ_polymer
│ ├── P450_binding
│ ├── P450_channel1
│ ├── SIC1
│ ├── T4L
│ ├── asyn
│ ├── baseline
│ ├── mopr
│ └── t4l
│ ├── functional_regions
│ ├── mopr
│ ├── t4l
│ └── diffnet
│ ├── hyperparameters
│ ├── mopR
│ ├── msm
│ ├── mopr
│ ├── asyn
│ └── vampnet
│ ├── optimization
│ ├── pASH1
│ └── pSIC1

└── usage : guidelines/tutorials for using URF

Dependencies

  • Numpy
  • scikit-learn
  • numba
  • copy
  • tqdm
  • multiprocessing
  • sys
  • fastcluster
  • gc
  • pickle
  • tables (only for certain functions of proximity_matrix.py, off by default)
  • scipy
  • joblib

Installation

conda create --name urf python=3.9
conda activate urf
git clone https://github.com/msahilgit/Unsupervised-Random-Forest
cd Unsupervised-Random-Forest/
pip install -e .
#also see 'alternative.txt' for use without installation

Usage

from URF.model import unsupervised_random_forest as urf
dobj=urf()
dobj.fit(data)
lc,fimp=dobj.get_output()
# see usage/t{1,2}.ipynb for details

Quick Links

Paper Data RF-MD-TICA previous-work

About

URF is a self-supervised version of traditional supervised-random-forest algorithm, facilitating the feature selection in protein biophysics for resolving protein's conformational representation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published