Skip to content

microsoft/dayhoff

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Dayhoff

Dayhoff is an Atlas of both protein sequence data and generative language models β€” a centralized resource that brings together 3.34 billion protein sequences across 1.7 billion clusters of metagenomic and natural protein sequences (GigaRef), 46 million structure-based synthetic sequences (BackboneRef), and 16 million multiple sequence alignments (OpenProteinSet). These models can natively predict zero-shot mutation effects on fitness, scaffold structural motifs by conditioning on evolutionary or structural context, and perform guided generation of novel proteins within specified families. Learning from metagenomic and structure-based synthetic data from the Dayhoff Atlas increased the cellular expression rates of generated proteins, highlighting the real-world value of expanding the scale, diversity, and novelty of protein sequence data.

The Dayhoff architecture is a hybrid of state-space Mamba layers and Transformer self-attention, interleaved with Mixture-of-Experts modules to maximize capacity while preserving efficiency. It natively handles long contexts, allowing both single sequences and unrolled MSAs to be modeled. Trained with an autoregressive objective in both N→C and C→N directions, Dayhoff supports order-agnostic infilling and scales to billions of parameters.

If you use the code, data, models, or results. please cite our preprint.

Table of Contents

Usage

The simplest way to use these models and datasets is via the HuggingFace interface. Alternately, you can install this package or use our Docker. Either way, you will need PyTorch, mamba=ssm, causal-conv1d, and flash-attn.

Prerequisites

Requirements:

  • PyTorch: 2.7.1
  • CUDA 12.8 and above

We recommend using uv and creating a clean environment.

uv venv dayhoff 
source dayhoff/bin/activate

In that new environment, install PyTorch 2.7.1.

uv pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128

Now, we need to install mamba-ssm, flash-attn, causal-conv1d, and their prerequisites.

uv pip install wheel packaging
uv pip install --no-build-isolation flash-attn causal-conv1d mamba-ssm

To import from HuggingFace, you will need to install these versions:

uv pip install datasets==3.2.0 #for HF datasets
uv pip install transformers==4.51.3
uv pip install huggingface_hub~=0.34.4

Now, you can simply import the models or datasets into your code.

from transformers import SuppressTokensLogitsProcessor
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
from datasets import load_dataset

model = AutoModelForCausalLM.from_pretrained('microsoft/Dayhoff-3b-GR-HM-c')
tokenizer = AutoTokenizer.from_pretrained('microsoft/Dayhoff-3b-GR-HM-c',
                                          trust_remote_code=True)

gigaref_clustered_train = load_dataset("microsoft/DayhoffDataset",
                  name="gigaref_no_singletons",
                  split="train")

Installation

Now, we can either install from pypi:

uv pip install dayhoff

Or, to be able to run the example scripts, clone the repo and install.

git clone https://github.com/microsoft.com/dayhoff.git 
uv pip install -e .

Docker

For a fully functional containerized environment without needing to install dependencies manually, you can use the provided Docker image instead:

docker pull samirchar/dayhoff:latest 
docker run -it samirchar/dayhoff:latest

Data and model availability

All Dayhoff models are available on AzureAIFoundry

Additionally, all Dayhoff models are also hosted on Hugging Face πŸ€—. All datasets used in the paper, with the exception of OpenProteinSet are available on Hugging Face in three formats: FASTA, Arrow, and JSONL.

GigaRef, BackboneRef, and DayhoffRef are available under CC BY License

Datasets

Training datasets

The Dayhoff models were trained on the Dayhoff Atlas, with varying data mixes which include:

UniRef50 (UR50) - dataset from UniProt, clustered at 50% sequence identity, contains only cluster representatives.

  • Splits: train (25 GB), test (26 MB), valid (26 MB)

UniRef90 (UR90) - dataset from UniProt, clustered at 90% sequence identity, contains cluster representatives and members.

  • Splits: train (83 GB), test (90 MB), valid (87 MB)

GigaRef (GR)– 3.43B protein sequences across 1.7B clusters of metagenomic and natural protein sequences. There are two subsets of gigaref:

  • GigaRef-clusters (GR) - Only includes cluster representatives and members, no singletons
    • Splits: train (433 GB), test (22 MB)
  • GigaRef-singletons (GR-s) - Only includes singletons
    • Splits: train (282 GB)

BackboneRef (BR) – 46M structure-derived synthetic sequences from c.a. 240,000 de novo backbones, with three subsets containing 10M sequences each:

  • BackboneRef unfiltered (BRu) – 10M sequences randomly sampled from all 46M designs.
    • Splits: train (3 GB)
  • BackboneRef quality (BRq) – 10M sequences sampled from 127,633 backbones whose average self-consistency RMSD ≀ 2 Γ….
    • Splits: train(3 GB)
  • BackboneRef novelty (BRn) – 10M sequences from 138,044 backbones with a max TM-score < 0.5 to any natural structure.
    • Splits: train (3GB)

OpenProteinSet (HM) – 16 million precomputed MSAs from 16M sequences in UniClust30 and 140,000 PDB chains.

DayhoffRef

Given the potential for generative models to expand the space of proteins and their functions, we used the Dayhoff models to generate DayhoffRef, a PLM-generated database of synthetic protein sequences

DayhoffRef: dataset of 16 million synthetic protein sequences generated by the Dayhoff models: Dayhoff-3b-UR90, Dayhoff-3b-GR-HM, Dayhoff-3b-GR-HM-c, and Dayhoff-170m-UR50-BRn.

  • Splits: train (5 GB)

Loading datasets in HuggingFace

Below are some examples on how to load the datasets using load_dataset in HuggingFace:

gigaref_clustered_train = load_dataset("microsoft/DayhoffDataset",
                  name="gigaref_no_singletons",
                  split="train")

uniref50_train = load_dataset("microsoft/DayhoffDataset",
                  name="uniref50",
                  split = "train")

backboneref_novelty = load_dataset("microsoft/DayhoffDataset",
                  name="backboneref",
                  split = "BBR_n")
                
dayhoffref = load_dataset("microsoft/DayhoffDataset",
                  name="dayhoffref",
                  split = "train")

For the largest datasets, consider using streaming=True.

Models

Weights are available for the following models, as described in the paper

170M parameter models

  • Dayhoff-170m-UR50: A 170M parameter model trained on UniRef50 cluster representatives
  • Dayhoff-170m-UR90: A 170M parameter model trained on UniRef90 members sampled by UniRef50 cluster
  • Dayhoff-170m-GR : A 170M parameter model trained on members sampled from GigaRef clusters
  • Dayhoff-170m-BRu: A 170M parameter model trained on UniRef50 cluster representatives and samples from unfiltered BackboneRef
  • Dayhoff-170m-BRq: A 170M parameter model trained on UniRef50 cluster representatives and samples from quality-filtered BackboneRef
  • Dayhoff-170m-BRn: A 170M parameter model trained on UniRef50 cluster representatives and samples from novelty-filtered BackboneRef

3B parameter models

  • Dayhoff-3b-UR90: A 3B parameter model trained on UniRef90 members sampled by UniRef50 cluster
  • Dayhoff-3b-GR-HM: A 3B parameter model trained on members sampled from GigaRef clusters and homologs from OpenProteinSet
  • Dayhoff-3b-GR-HM-c: A 3B parameter model trained on members sampled from GigaRef clusters and homologs from OpenProteinSet and subsequently cooled using UniRef90 members sampled by UniRef50 cluster and homologs from OpenProteinSet.

Unconditional generation

For most cases, use examples/generate.py to generate new protein sequences. Below is a sample command to generate 10 sequences with at most 100 residues and to place them in a fasta file in the directory generations/

python examples/generate.py generations/ --model-name Dayhoff-170m-UR50-BBR-n --max-length 100 --n-generations 10 --temp 1.0 --min-p 0.0 --random-seed 1 --gpu 0

Homolog-conditioned generation

[examples/generate.py] includes an option to pass a fasta file, in which case it performs sequence generation conditioned on the sequences in the fasta file. The order of the conditioning sequences will be randomly shuffled for each generation.

python examples/generate.py generations/ --fasta-file example.fasta --model-name Dayhoff-3b-GR-HM-c --max-length 128 --n-generations 10 --temp 1.0 --min-p 0.0 --random-seed 1 --gpu 0

Zero-shot fitness scoring

[examples/score.py] will compute backwards and forward average log likelihoods for every sequence in a fasta file.

python examples/score.py example.fasta output_dir/ --model-name Dayhoff-3b-GR-HM-c  --gpu 0

Analysis scripts

The following scipts were used to conduct analyses described in the paper.

Generation:

Dataset analysis:

Perplexity:

Sequence fidelity (via folding and inverse folding):

Distributional embedding analysis (via FPD and PNMMD):

Pfam annotation:

DayhoffRef compilation:

ProteinGym evals:

Scaffolding (Details in README.md in scaffolding/):

Evolution guided generation:

Cas9 evals:

Out-of-Scope Use Cases

This model should not be used to generate anything that is not a protein sequence or a set of homologuous protein sequences. It is not meant for natural language or other biological sequences, such as DNA sequences.

Responsible AI Considerations

The intended use of this model is to generate high-quality, realistic, protein sequences or sets of homologous protein sequences. Generations can be designed from scratch or conditioned on partial sequences in both N→C and C→N directions.

Risks and limitations: Not all sequences are guaranteed to be realistic. It remains difficult to generate high-quality sequences with no sequence homology to any natural sequence.

The code and datasets released in this repository are provided for research and development use only. They are not intended for use in clinical decision-making or for any other clinical use, and the performance of these models for clinical use has not been established. You bear sole responsibility for any use of these models, data and software, including incorporation into any product intended for clinical use.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages