Skip to content

Comments

[New Model] PoET-2 for DMS Zero-Shot Benchmarks#89

Open
timt51 wants to merge 2 commits intoOATML-Markslab:mainfrom
OpenProteinAI:poet-2-unsupervised
Open

[New Model] PoET-2 for DMS Zero-Shot Benchmarks#89
timt51 wants to merge 2 commits intoOATML-Markslab:mainfrom
OpenProteinAI:poet-2-unsupervised

Conversation

@timt51
Copy link
Contributor

@timt51 timt51 commented Sep 2, 2025

This PR adds a new baseline model, PoET-2, for the DMS (substitutions and indels) and Clinical (substitutions and indels) unsupervised benchmarks.


⚙️ Setup

  1. Download Model Weights & MSAs

    To download the model weights and the Multiple Sequence Alignments (MSAs) required for predictions, run the following commands:

    cd proteingym/baselines/PoET-2
    make download

    This will save the model weights to ~/.cache/ProteinGym/baselines/PoET-2 and the MSAs to ~/.cache/ProteinGym/baselines/PoET. Note that the MSAs are the same as those used for PoET(-1).


🚀 Running Inference

The scoring scripts for each benchmark are detailed in the table below.

Benchmark Script Path Output Directory
DMS Substitutions scripts/scoring_DMS_zero_shot/scoring_PoET_2_substitutions.sh ${DMS_output_score_folder_subs}PoET-2
DMS Indels scripts/scoring_DMS_zero_shot/scoring_PoET_2_indels.sh ${DMS_output_score_folder_indels}PoET-2
Clinical Substitutions scripts/scoring_clinical_zero_shot/scoring_PoET_2_substitutions.sh ${clinical_output_score_folder_subs}PoET-2
Clinical Indels scripts/scoring_clinical_zero_shot/scoring_PoET_2_indels.sh ${clinical_output_score_folder_indels}PoET-2

These scripts will automatically download predicted protein structures from AlphaFoldDB, which can take a significant amount of time. The structures are saved to the following cache directories, requiring approximately ~12 GB for the DMS benchmark and ~310 GB for the Clinical benchmark:

  • DMS: ${PROTEINGYM_CACHE}/baselines/PoET-2/DMS_AF2_structures_cache
  • Clinical: ${PROTEINGYM_CACHE}/baselines/PoET-2/clinical_AF2_structures_cache

To download the structures without running model inference, you can first run the scripts with the SAMPLE_PROMPTS_ONLY=1 environment variable.

By default, all scoring scripts will attempt to utilize every GPU available to them.


💡 A Note on GPU Usage

While the scripts default to using all available GPUs, the nature of the two benchmarks lends itself to different parallelization strategies. You may want to consider the following for the most efficient inference:

  • For the DMS benchmark, which contains proteins with a very large number of variants, the default behavior of using all available GPUs is effective for accelerating inference and preventing the run for any single protein from taking an excessively long time.
  • For the Clinical benchmark, which contains a larger number of proteins each with fewer variants, it can be more efficient to parallelize the workflow across proteins. Due to the overhead of launching each script, you may find it faster to run inference for each protein on a single GPU and process multiple proteins in parallel.

📊 Performance

Using the provided evaluation scripts, we obtained the following performance:

Benchmark Metric Score
DMS Substitutions Spearman's $\rho$ 0.500
DMS Indels Spearman's $\rho$ 0.573
Clinical Substitutions AUROC 0.932
Clinical Indels (Leaderboard set*) AUROC 0.949
Clinical Indels (All variants) AUROC 0.945

* The set of indels on the public leaderboard is a subset of all available variants in the benchmark's downloadable data.


🔗 Precomputed Scores

Precomputed PoET-2 predictions can be downloaded from the following links. In each file, the score is located in the final column, named PoET-2.

Update on 2026/01/26

Outputs from performance_DMS_benchmarks.py for DMS benchmarks (AFDB v6) are also available here.

@JulesGM
Copy link

JulesGM commented Dec 17, 2025

Bonjour cher @pascalnotin,

I was wondering if you had the time to look at this? Integrating this model would be useful. Thanks.

@timt51 timt51 mentioned this pull request Dec 17, 2025
@timt51
Copy link
Contributor Author

timt51 commented Dec 17, 2025

I realized that I may have needed to create an accompanying issue, so here's that issue: #93.

@pascalnotin
Copy link
Contributor

Thank you for the PR @timt51 !
Finally found some time over holidays to look into this. Having slightly lower numbers on the DMS benchmarks during reproduction: 0.496 for substitutions, 0.565 for indels. Tried running the same assays multiple times on different machines and the run-to-run variation is negligible on my end. Has anything changed re: PoET2 checkpoint and/or AF2 structures cache? I'm using the PoET MSAs (baselines/PoET/msas/DMS_substitutions) and AF2 structures cache you provided in your makefile.

@timt51
Copy link
Contributor Author

timt51 commented Jan 6, 2026

Happy New Year!
I don't think anything has changed since I ran this, but perhaps I'm forgetting something. Will double check and get back to you. Are you able to share the predictions you obtained?

@timt51
Copy link
Contributor Author

timt51 commented Jan 26, 2026

Hi @pascalnotin,

The discrepancy was due to an issue with downloading structures from AFDB. The scoring script previously attempted to download structures from AFDBv4, but most of those links are now broken. I have updated the scoring script to download structures from the latest version of AFDB, AFDBv6.

Deleting the following directories, if present, and then rerunning the scripts should reproduce the performance numbers reported in the PR description within ~0.001.

DMS: ${PROTEINGYM_CACHE}/baselines/PoET-2/DMS_AF2_structures_cache
Clinical: ${PROTEINGYM_CACHE}/baselines/PoET-2/clinical_AF2_structures_cache

I have also updated the PR description with links to the new scores produced by running the updated scoring script on the DMS benchmarks.

@JulesGM
Copy link

JulesGM commented Feb 23, 2026

@pascalnotin Bonjour, just wondering if you had had the time to look at this, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants