GitHub - nlpnorth/snakmodel: An LLM continually pre-trained specifically for Danish.

SnakModel is a 7B-parameter, autoregressive language model specifically designed for Danish. There are both an instruction-tuned variant, as well as a base version for further fine-tuning. Our models build upon Llama 2, which we continuously pre-train on a diverse collection of Danish corpora comprising 350M documents and 13.6B words, before tuning it on 3.7M Danish instruction-answer pairs.

Developers

🧭 NLPnorth research unit at the IT University of Copenhagen, Denmark.
🌊 AAU-NLP research unit at Aalborg University Copenhagen, Denmark.

Mike Zhang*, Max Müller-Eberstein*, Elisa Bassignana, Rob van der Goot.
*equal contribution.

Resources

💬 SnakModeller:
- SnakModel-7B (base): The base LM trained on Danish text completion + its intermediate checkpoints.
- SnakModel-7B (instruct): An instruction-tuned variant of the base model + its intermediate checkpoints.
⚙️ Model Training and Analysis Code:
- Research Paper: "SnakModel: Lessons Learned from Training an Open Danish Large Language Model" (Zhang et al., 2025), presented at NoDaLiDa/Baltic-HLT 2025.
- Codebase: this repository.
- Training Data: Find our data here on HF🤗.
🇩🇰 DaKultur: A Danish Cultural Awareness Evaluation:
- Research Paper: "DaKultur: Evaluating the Cultural Awareness of Language Models for Danish with Native Speakers" (Müller-Eberstein et al., 2025), presented at NB-REAL (NoDaLiDa 2025) and C3NLP (NAACL 2025).
- Dataset: Find our cultural evalaution dataset here on HF🤗.
- Web-based LLM Evaluation Interface: coming soon.

Codebase

This codebase contains code to replicate all experiments from the research papers related to SnakModel. This includes:

analyze/:
- divergence/: contains scripts to analyze the divergence of embedding and model weights before and after adaptation.
- leakage/: contains the scripts we used to estimate train/test data leakage.
- prompt.py: is an interactive script, which allows for prompting model checkpoints locally.
evaluate/: contains a script to run ScandEval benchmarking on all relevant checkpoints.
finetune/: contains a script to run instruction fine-tuning using LoRA.
plot/: contains scripts to replicate plots from the SnakModel research papers.
pretrain/: contains our multi-GPU pre-training scripts adapted from Megatron-LLM. Also include data collection and pre-processing.

Installation

For pre-training, please refer to the instructions in the pretrain/ folder. We recommend running pre-training in a separate installation in Docker.

For instruction-tuning, and subsequent analyses, simply install the required packages (ideally, in a virtual environment):

(venv) $ pip install -r requirements.txt

Citation

If you find the work in this repository useful, please don't forget to cite:

@inproceedings{snakmodel,
  title={{S}nak{M}odel: Lessons Learned from Training an Open Danish Large Language Model},
  author={Mike Zhang and Max M{\"u}ller-Eberstein and Elisa Bassignana and Rob van der Goot},
  booktitle={The Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies},
  year={2024},
  url={https://openreview.net/forum?id=YxzfgQGpRQ}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
analyze		analyze
evaluate		evaluate
finetune		finetune
plot		plot
pretrain		pretrain
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
snakmodel.png		snakmodel.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Resources

Codebase

Installation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

nlpnorth/snakmodel

Folders and files

Latest commit

History

Repository files navigation

Resources

Codebase

Installation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages