[New Model] PoET-2 for DMS Zero-Shot Benchmarks by timt51 · Pull Request #89 · OATML-Markslab/ProteinGym

timt51 · 2025-09-02T16:06:49Z

This PR adds a new baseline model, PoET-2, for the DMS (substitutions and indels) and Clinical (substitutions and indels) unsupervised benchmarks.

⚙️ Setup

Download Model Weights & MSAs

To download the model weights and the Multiple Sequence Alignments (MSAs) required for predictions, run the following commands:
```
cd proteingym/baselines/PoET-2
make download
```
This will save the model weights to ~/.cache/ProteinGym/baselines/PoET-2 and the MSAs to ~/.cache/ProteinGym/baselines/PoET. Note that the MSAs are the same as those used for PoET(-1).

🚀 Running Inference

The scoring scripts for each benchmark are detailed in the table below.

Benchmark	Script Path	Output Directory
DMS Substitutions	`scripts/scoring_DMS_zero_shot/scoring_PoET_2_substitutions.sh`	`${DMS_output_score_folder_subs}PoET-2`
DMS Indels	`scripts/scoring_DMS_zero_shot/scoring_PoET_2_indels.sh`	`${DMS_output_score_folder_indels}PoET-2`
Clinical Substitutions	`scripts/scoring_clinical_zero_shot/scoring_PoET_2_substitutions.sh`	`${clinical_output_score_folder_subs}PoET-2`
Clinical Indels	`scripts/scoring_clinical_zero_shot/scoring_PoET_2_indels.sh`	`${clinical_output_score_folder_indels}PoET-2`

These scripts will automatically download predicted protein structures from AlphaFoldDB, which can take a significant amount of time. The structures are saved to the following cache directories, requiring approximately ~12 GB for the DMS benchmark and ~310 GB for the Clinical benchmark:

DMS: ${PROTEINGYM_CACHE}/baselines/PoET-2/DMS_AF2_structures_cache
Clinical: ${PROTEINGYM_CACHE}/baselines/PoET-2/clinical_AF2_structures_cache

To download the structures without running model inference, you can first run the scripts with the SAMPLE_PROMPTS_ONLY=1 environment variable.

By default, all scoring scripts will attempt to utilize every GPU available to them.

💡 A Note on GPU Usage

While the scripts default to using all available GPUs, the nature of the two benchmarks lends itself to different parallelization strategies. You may want to consider the following for the most efficient inference:

For the DMS benchmark, which contains proteins with a very large number of variants, the default behavior of using all available GPUs is effective for accelerating inference and preventing the run for any single protein from taking an excessively long time.
For the Clinical benchmark, which contains a larger number of proteins each with fewer variants, it can be more efficient to parallelize the workflow across proteins. Due to the overhead of launching each script, you may find it faster to run inference for each protein on a single GPU and process multiple proteins in parallel.

📊 Performance

Using the provided evaluation scripts, we obtained the following performance:

Benchmark	Metric	Score
DMS Substitutions	Spearman's $\rho$	`0.500`
DMS Indels	Spearman's $\rho$	`0.573`
Clinical Substitutions	AUROC	`0.932`
Clinical Indels (Leaderboard set*)	AUROC	`0.949`
Clinical Indels (All variants)	AUROC	`0.945`

* The set of indels on the public leaderboard is a subset of all available variants in the benchmark's downloadable data.

🔗 Precomputed Scores

Precomputed PoET-2 predictions can be downloaded from the following links. In each file, the score is located in the final column, named PoET-2.

Update on 2026/01/26

Outputs from performance_DMS_benchmarks.py for DMS benchmarks (AFDB v6) are also available here.

JulesGM · 2025-12-17T21:34:37Z

Bonjour cher @pascalnotin,

I was wondering if you had the time to look at this? Integrating this model would be useful. Thanks.

timt51 · 2025-12-17T22:49:44Z

I realized that I may have needed to create an accompanying issue, so here's that issue: #93.

pascalnotin · 2026-01-06T17:15:49Z

Thank you for the PR @timt51 !
Finally found some time over holidays to look into this. Having slightly lower numbers on the DMS benchmarks during reproduction: 0.496 for substitutions, 0.565 for indels. Tried running the same assays multiple times on different machines and the run-to-run variation is negligible on my end. Has anything changed re: PoET2 checkpoint and/or AF2 structures cache? I'm using the PoET MSAs (baselines/PoET/msas/DMS_substitutions) and AF2 structures cache you provided in your makefile.

timt51 · 2026-01-06T21:54:38Z

Happy New Year!
I don't think anything has changed since I ran this, but perhaps I'm forgetting something. Will double check and get back to you. Are you able to share the predictions you obtained?

timt51 · 2026-01-26T03:21:22Z

Hi @pascalnotin,

The discrepancy was due to an issue with downloading structures from AFDB. The scoring script previously attempted to download structures from AFDBv4, but most of those links are now broken. I have updated the scoring script to download structures from the latest version of AFDB, AFDBv6.

Deleting the following directories, if present, and then rerunning the scripts should reproduce the performance numbers reported in the PR description within ~0.001.

DMS: ${PROTEINGYM_CACHE}/baselines/PoET-2/DMS_AF2_structures_cache
Clinical: ${PROTEINGYM_CACHE}/baselines/PoET-2/clinical_AF2_structures_cache

I have also updated the PR description with links to the new scores produced by running the updated scoring script on the DMS benchmarks.

JulesGM · 2026-02-23T21:38:51Z

@pascalnotin Bonjour, just wondering if you had had the time to look at this, thanks!

PoET-2 Unsupervised

b011290

timt51 mentioned this pull request Sep 2, 2025

[New Model] PoET-2 for DMS Zero-Shot Benchmarks #88

Closed

timt51 mentioned this pull request Dec 17, 2025

[New Model] PoET-2 #93

Open

update structure download function to download from afdb version 6

389b4de

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[New Model] PoET-2 for DMS Zero-Shot Benchmarks#89

[New Model] PoET-2 for DMS Zero-Shot Benchmarks#89
timt51 wants to merge 2 commits intoOATML-Markslab:mainfrom
OpenProteinAI:poet-2-unsupervised

timt51 commented Sep 2, 2025 •

edited

Loading

Uh oh!

JulesGM commented Dec 17, 2025

Uh oh!

timt51 commented Dec 17, 2025

Uh oh!

pascalnotin commented Jan 6, 2026

Uh oh!

timt51 commented Jan 6, 2026

Uh oh!

timt51 commented Jan 26, 2026 •

edited

Loading

Uh oh!

JulesGM commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

timt51 commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚙️ Setup

🚀 Running Inference

💡 A Note on GPU Usage

📊 Performance

🔗 Precomputed Scores

Update on 2026/01/26

Uh oh!

JulesGM commented Dec 17, 2025

Uh oh!

timt51 commented Dec 17, 2025

Uh oh!

pascalnotin commented Jan 6, 2026

Uh oh!

timt51 commented Jan 6, 2026

Uh oh!

timt51 commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JulesGM commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

timt51 commented Sep 2, 2025 •

edited

Loading

timt51 commented Jan 26, 2026 •

edited

Loading