MotifAE is a sparse autoencoder with an additional smoothness loss, designed for the unsupervised discovery of functional sequence patterns from Protein Language Model (we used ESM2-650M here). The smoothness loss encourages the latent features to capture meaningful, coherent patterns in the protein sequences.

For MotifAE training and running, please set up the environment first:
conda create -n motifae python=3.12
conda activate motifae
pip install fair-esm pandas numpy
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121Other environments with the necessary dependencies should also work as long as PyTorch and ESM are properly installed.
The weights for the trained MotifAE model can be downloaded from Zenodo.
Please refer to
notebook/ELM_motif.ipynb for instructions on how to:
- Load a trained MotifAE model,
- Extract latent features from protein sequences,
- Compare these latent features with known annotations (e.g., ELM motifs),
To use MotifAE-G for aligning latent features with experimental data — for example, domain folding stability — follow the steps below:
- The information for proteins with stability mutational scanning is provided in
data/412pro_info.csv. Please download the corresponding mutation effect data (412pros_ddG_ML.csv) from Zenodo. - Process these data using code in
gate/1_stability_prepare.ipynb - Train MotifAE-G to align latent features with experimental measurements:
gate/2_gate_model.ipynb - Designing proteins with enhanced stability by steering stability-associated latent features:
gate/3_protein_design.ipynb
To train the MotifAE from scratch, follow the steps below.
Download the dataset of 2.3 million representative proteins from Zenodo.
This dataset was derived through structure-based clustering of the AlphaFold structure database.
Generate ESM2-650M last layer embeddings for these representative proteins.
-
Edit
model/config.pyto specify:- The path to your downloaded protein
representative_2.3M_seq.csvfiles. - The output directory where embeddings will be saved.
- The path to your downloaded protein
-
Execute the following command to generate embeddings (this takes a few days to finish on one GPU):
python model/get_esm2_embedding.py
Once the embeddings are ready, you can train the MotifAE model.
-
Edit
model/config.pyto define the path for saving model checkpoints and log files -
Run Training Script (finish in one day on one GPU)
python model/train_esm_sae.py
MotifAE Reveals Functional Sequence Patterns from Protein Language Model: Unsupervised Discovery and Interpretability Analysis. Chao Hou, Di Liu, Yufeng Shen. BioRxiv, November 2025. link