SnakModel is a 7B-parameter, autoregressive language model specifically designed for Danish. There are both an instruction-tuned variant, as well as a base version for further fine-tuning. Our models build upon Llama 2, which we continuously pre-train on a diverse collection of Danish corpora comprising 350M documents and 13.6B words, before tuning it on 3.7M Danish instruction-answer pairs.
Developers
🧭 NLPnorth research unit at the IT University of Copenhagen, Denmark.
🌊 AAU-NLP research unit at Aalborg University Copenhagen, Denmark.
Mike Zhang*, Max Müller-Eberstein*, Elisa Bassignana, Rob van der Goot.
*equal contribution.
- 💬 SnakModeller:
- SnakModel-7B (base): The base LM trained on Danish text completion + its intermediate checkpoints.
- SnakModel-7B (instruct): An instruction-tuned variant of the base model + its intermediate checkpoints.
- ⚙️ Model Training and Analysis Code:
- Research Paper: "SnakModel: Lessons Learned from Training an Open Danish Large Language Model" (Zhang et al., 2025), presented at NoDaLiDa/Baltic-HLT 2025.
- Codebase: this repository.
- Training Data: Find our data here on HF🤗.
- 🇩🇰 DaKultur: A Danish Cultural Awareness Evaluation:
- Research Paper: "DaKultur: Evaluating the Cultural Awareness of Language Models for Danish with Native Speakers" (Müller-Eberstein et al., 2025), presented at NB-REAL (NoDaLiDa 2025) and C3NLP (NAACL 2025).
- Dataset: Find our cultural evalaution dataset here on HF🤗.
- Web-based LLM Evaluation Interface: coming soon.
This codebase contains code to replicate all experiments from the research papers related to SnakModel. This includes:
analyze/:divergence/: contains scripts to analyze the divergence of embedding and model weights before and after adaptation.leakage/: contains the scripts we used to estimate train/test data leakage.prompt.py: is an interactive script, which allows for prompting model checkpoints locally.
evaluate/: contains a script to run ScandEval benchmarking on all relevant checkpoints.finetune/: contains a script to run instruction fine-tuning using LoRA.plot/: contains scripts to replicate plots from the SnakModel research papers.pretrain/: contains our multi-GPU pre-training scripts adapted from Megatron-LLM. Also include data collection and pre-processing.
For pre-training, please refer to the instructions in the pretrain/ folder. We recommend running pre-training in a separate installation in Docker.
For instruction-tuning, and subsequent analyses, simply install the required packages (ideally, in a virtual environment):
(venv) $ pip install -r requirements.txtIf you find the work in this repository useful, please don't forget to cite:
@inproceedings{snakmodel,
title={{S}nak{M}odel: Lessons Learned from Training an Open Danish Large Language Model},
author={Mike Zhang and Max M{\"u}ller-Eberstein and Elisa Bassignana and Rob van der Goot},
booktitle={The Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies},
year={2024},
url={https://openreview.net/forum?id=YxzfgQGpRQ}
}