Accurate data representation is paramount in molecular dynamics simulations to capture the functionally relevant motions of proteins. Traditional feature selection methods, while effective, often rely on labeled data, limiting their applicability to novel systems. Here, we present unsupervised random forest (URF), a self-supervised adaptation of traditional random forests that identifies functionally critical features without requiring prior labels. URF-selected features highlight key functional regions, enabling the identification of important residues in diverse proteins. By implementing a memory-efficient version, we demonstrate URF's capability to resolve functional states in around 10 diverse systems, including folded and intrinsically disordered proteins, performing on par with or surpassing 16 leading baseline methods. Crucially, URF is guided by an internal metric, the learning coefficient, which automates hyper-parameter optimization, making the method robust and user-friendly. Benchmarking results reveal URF's distinct ability to produce functionally meaningful representations in comparison to previously reported methods, facilitating downstream analyses such as Markov state modeling . The investigation presented here establishes URF as a leading tool for unsupervised representation learning in protein biophysics.
this repository is implementation of URF protocol, corresponding to publication(ref.).
MAIN
├── URF : the unsupervised-random-forest module
│
├── data : scripts for data estimation from MD trajectories
│ ├── ASH1
│ ├── LJ polymer
│ ├── P450_binding
│ ├── P450_channel1
│ ├── SIC1
│ ├── T4L
│ ├── asyn
│ ├── mopR
│ ├── mopR_ensembles
│ ├── pASH1
│ └── pSIC1
│
├── scripts : scripts for reproducibility of results
│ ├── 0_python_modules
│ ├── ASH1
│ ├── LJ_polymer
│ ├── P450_binding
│ ├── P450_channel1
│ ├── SIC1
│ ├── T4L
│ ├── asyn
│ ├── baseline
│ ├── mopr
│ └── t4l
│ ├── functional_regions
│ ├── mopr
│ ├── t4l
│ └── diffnet
│ ├── hyperparameters
│ ├── mopR
│ ├── msm
│ ├── mopr
│ ├── asyn
│ └── vampnet
│ ├── optimization
│ ├── pASH1
│ └── pSIC1
│
└── usage : guidelines/tutorials for using URF
- Numpy
- scikit-learn
- numba
- copy
- tqdm
- multiprocessing
- sys
- fastcluster
- gc
- pickle
- tables (only for certain functions of proximity_matrix.py, off by default)
- scipy
- joblib
conda create --name urf python=3.9
conda activate urf
git clone https://github.com/msahilgit/Unsupervised-Random-Forest
cd Unsupervised-Random-Forest/
pip install -e .
#also see 'alternative.txt' for use without installationfrom URF.model import unsupervised_random_forest as urf
dobj=urf()
dobj.fit(data)
lc,fimp=dobj.get_output()
# see usage/t{1,2}.ipynb for details