Code and resources for genomic language models GPN, GPN-MSA, PhyloGPN and GPN-Star.
Install directly from GitHub:
pip install git+https://github.com/songlab-cal/gpn.gitFor development (editable install):
git clone https://github.com/songlab-cal/gpn.git
cd gpn
pip install -e .| Model | Paper | Notes |
|---|---|---|
| GPN | Benegas et al. 2023 | Requires unaligned genomes |
| GPN-MSA | Benegas et al. 2025 | Requires aligned genomes for both training and inference [deprecated in favor of GPN-Star] |
| PhyloGPN | Albors et al. 2025 | Uses an alignment during training, but does not require it for inference or fine-tuning |
| GPN-Star | Ye et al. 2025 | Requires aligned genomes for both training and inference |
A single-sequence genomic language model trained on unaligned genomes. Also known as GPN-SS.
import gpn.model # registers architecture for AutoModel
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("songlab/gpn-brassicales")- Play with the model: examples/ss/basic_example.ipynb
- Model implementation: gpn/model.py, gpn/ss
Benegas, Batra and Song "DNA language models are powerful predictors of genome-wide variant effects" PNAS (2023)
- Collection: HuggingFace 🤗
- Pretraining dataset: Arabidopsis and 7 other Brassicales (genomes-brassicales-balanced-v1)
- Models:
- Analysis code:
- Additional resources:
Benegas, Eraslan and Song "Benchmarking DNA sequence models for causal regulatory variant prediction in human genetics" bioRxiv (2025)
- Collection: HuggingFace 🤗
- Pretraining dataset: Animal promoter sequences (gpn-animal-promoter-dataset)
- Models:
- Benchmark datasets:
- Analysis code:
- Additional resources:
- Collection: HuggingFace 🤗
- Finetuning dataset: Sorghum gene expression data from Gene Expression Atlas (gxa-sorghum-v1)
- Models:
- gpn-brassicales-gxa-sorghum-v1 (fine-tuned from gpn-brassicales)
- Analysis code:
1. Create a dataset
Use the Snakemake workflow to create a dataset:
- Can automatically download data from NCBI given a list of accessions, or use your own fasta files
- Navigate to
workflow/make_dataset/, configureconfig/config.yamlandconfig/assemblies.tsv, then run:snakemake --cores all
2. Train the model
Training features:
- Automatically detects all available GPUs
- Track metrics on Weights & Biases
- Implemented encoders:
convnet(default),roformer(Transformer),bytenet - Specify config overrides: e.g.
--config_overrides encoder=bytenet,num_hidden_layers=30 - The number of steps that you can train without overfitting will be a function of the size and diversity of your dataset
Example command:
WANDB_PROJECT=your_project torchrun --nproc_per_node=$(echo $CUDA_VISIBLE_DEVICES | awk -F',' '{print NF}') -m gpn.ss.run_mlm --do_train --do_eval \
--report_to wandb --prediction_loss_only True --remove_unused_columns False \
--dataset_name results/dataset --tokenizer_name gonzalobenegas/tokenizer-dna-mlm \
--soft_masked_loss_weight_train 0.1 --soft_masked_loss_weight_evaluation 0.0 \
--weight_decay 0.01 --optim adamw_torch \
--dataloader_num_workers 16 --seed 42 \
--save_strategy steps --save_steps 10000 --evaluation_strategy steps \
--eval_steps 10000 --logging_steps 10000 --max_steps 120000 --warmup_steps 1000 \
--learning_rate 1e-3 --lr_scheduler_type constant_with_warmup \
--run_name your_run --output_dir your_output_dir --model_type GPN \
--per_device_train_batch_size 512 --per_device_eval_batch_size 512 --gradient_accumulation_steps 1 --total_batch_size 2048 \
--torch_compile \
--ddp_find_unused_parameters False \
--bf16 --bf16_full_eval3. Extract embeddings
Input file requires chrom, start, end columns.
Example command:
torchrun --nproc_per_node=$(echo $CUDA_VISIBLE_DEVICES | awk -F',' '{print NF}') -m gpn.ss.get_embeddings \
windows.parquet genome.fa.gz 100 your_output_dir results.parquet \
--per_device_batch_size 4000 --is_file --dataloader_num_workers 164. Variant effect prediction
Input file requires chrom, pos, ref, alt columns.
Example command:
torchrun --nproc_per_node=$(echo $CUDA_VISIBLE_DEVICES | awk -F',' '{print NF}') -m gpn.ss.run_vep \
variants.parquet genome.fa.gz 512 your_output_dir results.parquet \
--per_device_batch_size 4000 --is_file --dataloader_num_workers 16A genomic language model trained on whole-genome alignments across multiple species.
import gpn.model # registers architecture for AutoModel
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("songlab/gpn-msa-sapiens")- Play with the model: examples/msa/basic_example.ipynb
- Variant effect prediction: examples/msa/vep.ipynb
- Training (human): examples/msa/training.ipynb
- Model implementation: gpn/model.py, gpn/msa
Benegas, Albors, Aw, Ye and Song "A DNA language model based on multispecies alignment predicts the effects of genome-wide variants" Nature Biotechnology (2025)
- Collection: HuggingFace 🤗
- Pretraining datasets:
- 100-way vertebrate alignment: multiz100way, 89.zarr
- Training regions: gpn-msa-sapiens-dataset
- Models:
- Benchmark datasets (including predictions from all models):
- Analysis code:
- Additional resources:
- See #28, #40, #44
- Another source for plant alignments: https://plantregmap.gao-lab.org/download.php#alignment-conservation
A phylogenetic genomic language model that uses an alignment during training but does not require it for inference or fine-tuning. PhyloGPN is a convolutional neural network that outputs rate matrix parameters for Felsenstein's F81 substitution model, trained on the Zoonomia alignment. It can be used for transfer learning and zero-shot variant deleteriousness prediction, especially useful for sequences not in reference genomes.
from transformers import AutoModel
model = AutoModel.from_pretrained("songlab/PhyloGPN", trust_remote_code=True)- Play with the model: examples/phylogpn/basic_example.ipynb
- Model implementation: gpn/phylogpn.py
Albors, Li, Benegas, Ye and Song "A Phylogenetic Approach to Genomic Language Modeling" RECOMB (2025)
- Models:
A phylogeny-aware genomic language model trained on whole-genome alignments across multiple evolutionary timescales.
import gpn.star.model # registers architecture for AutoModel
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained("songlab/gpn-star-hg38-p243-200m")- Play with the model: examples/star/demo.ipynb
- Model implementation: gpn/model.py, gpn/star
Ye, Benegas, Albors, Li, Prillo, Fields, Clarke and Song "Predicting functional constraints across evolutionary timescales with phylogeny-informed genomic language models" bioRxiv (2025)
- Collection: HuggingFace 🤗
- Pretraining datasets:
- Vertebrate alignment: multiz100way (100 species)
- Mammalian alignment: cactus447way (447 species)
- Models:
- Human (hg38):
- gpn-star-hg38-v100-200m (vertebrate, 200M params)
- gpn-star-hg38-m447-200m (mammalian, 200M params)
- gpn-star-hg38-p243-200m (primate, 200M params)
- Model organisms:
- gpn-star-mm39-v35-85m (mouse, 85M params)
- gpn-star-galGal6-v77-85m (chicken, 85M params)
- gpn-star-dm6-i124-85m (fly, 85M params)
- gpn-star-ce11-n135-25m (worm, 25M params)
- gpn-star-tair10-b18-25m (arabidopsis, 25M params)
- Human (hg38):
- Benchmark datasets (including predictions from all models):
- Included in collection
- Homo sapiens:
- clinvar_vs_benign - Missense variant pathogenicity classification (Pathogenic vs. Benign)
- cosmic - Cancer somatic mutations (COSMIC frequent vs. gnomAD common missense)
- omim_traitgym - Mendelian regulatory variants (pathogenic vs. common)
- ukb_finemapped_coding - UK Biobank fine-mapped coding variants
- ukb_finemapped_nc_traitgym - UK Biobank fine-mapped non-coding variants
- gnomad_balanced - Allele frequency enrichment analysis (rare vs. common variants)
- ldsc - S-LDSC variants and model predictions for heritability analysis
- Mus musculus:
- wmgp_balanced - Wild Mouse Genome Project population allele frequencies
- mmrdb - Mouse Mutant Resource Database pathogenic variants
- Drosophila melanogaster:
- dest - Drosophila Evolution in Space and Time allele frequencies
- flybase_lethal - Experimentally validated lethal mutations
- Caenorhabditis elegans:
- caendr - C. elegans Natural Diversity Resource allele frequencies
- celegans_lethal - 72 experimentally validated lethal SNVs
- Gallus gallus:
- galbase - Chicken population allele frequencies
- Arabidopsis thaliana:
- 1001gp - Population allele frequencies from 1001 Genome Project
- Analysis code:
- Model training and main results on variant effect prediction: analysis/gpn-star/train_and_eval
- Complex trait heritability analysis (S-LDSC): analysis/gpn-star/s-ldsc
- Whole-genome alignment processing: analysis/gpn-star/wga_processing
- Model interpretation: analysis/gpn-star/interpretation
- Questions? Open a Discussion for usage questions, ideas, or general help
- Issues? Report bugs or request features via Issues
GPN:
@article{benegas2023dna,
title={DNA language models are powerful predictors of genome-wide variant effects},
author={Benegas, Gonzalo and Batra, Sanjit Singh and Song, Yun S},
journal={Proceedings of the National Academy of Sciences},
volume={120},
number={44},
pages={e2311219120},
year={2023},
publisher={National Acad Sciences}
}@article{benegas2025dna,
title={A DNA language model based on multispecies alignment predicts the effects of genome-wide variants},
author={Benegas, Gonzalo and Albors, Carlos and Aw, Alan J and Ye, Chengzhong and Song, Yun S},
journal={Nature Biotechnology},
pages={1--6},
year={2025},
publisher={Nature Publishing Group US New York}
}@inproceedings{albors2025phylogenetic,
title={A Phylogenetic Approach to Genomic Language Modeling},
author={Albors, Carlos and Li, Jianan Canal and Benegas, Gonzalo and Ye, Chengzhong and Song, Yun S},
booktitle={International Conference on Research in Computational Molecular Biology},
pages={99--117},
year={2025},
organization={Springer}
}@article{ye2025predicting,
title={Predicting functional constraints across evolutionary timescales with phylogeny-informed genomic language models},
author={Ye, Chengzhong and Benegas, Gonzalo and Albors, Carlos and Li, Jianan Canal and Prillo, Sebastian and Fields, Peter D and Clarke, Brian and Song, Yun S},
journal={bioRxiv},
pages={2025--09},
year={2025},
publisher={Cold Spring Harbor Laboratory}
}