Skip to content

An LLM continually pre-trained specifically for Danish.

License

Notifications You must be signed in to change notification settings

nlpnorth/snakmodel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SnakModel Logo

SnakModel is a 7B-parameter, autoregressive language model specifically designed for Danish. There are both an instruction-tuned variant, as well as a base version for further fine-tuning. Our models build upon Llama 2, which we continuously pre-train on a diverse collection of Danish corpora comprising 350M documents and 13.6B words, before tuning it on 3.7M Danish instruction-answer pairs.

Developers

🧭 NLPnorth research unit at the IT University of Copenhagen, Denmark.
🌊 AAU-NLP research unit at Aalborg University Copenhagen, Denmark.

Mike Zhang*, Max Müller-Eberstein*, Elisa Bassignana, Rob van der Goot.
*equal contribution.

Resources

  • 💬 SnakModeller:
  • ⚙️ Model Training and Analysis Code:
    • Research Paper: "SnakModel: Lessons Learned from Training an Open Danish Large Language Model" (Zhang et al., 2025), presented at NoDaLiDa/Baltic-HLT 2025.
    • Codebase: this repository.
    • Training Data: Find our data here on HF🤗.
  • 🇩🇰 DaKultur: A Danish Cultural Awareness Evaluation:
    • Research Paper: "DaKultur: Evaluating the Cultural Awareness of Language Models for Danish with Native Speakers" (Müller-Eberstein et al., 2025), presented at NB-REAL (NoDaLiDa 2025) and C3NLP (NAACL 2025).
    • Dataset: Find our cultural evalaution dataset here on HF🤗.
    • Web-based LLM Evaluation Interface: coming soon.

Codebase

This codebase contains code to replicate all experiments from the research papers related to SnakModel. This includes:

  • analyze/:
    • divergence/: contains scripts to analyze the divergence of embedding and model weights before and after adaptation.
    • leakage/: contains the scripts we used to estimate train/test data leakage.
    • prompt.py: is an interactive script, which allows for prompting model checkpoints locally.
  • evaluate/: contains a script to run ScandEval benchmarking on all relevant checkpoints.
  • finetune/: contains a script to run instruction fine-tuning using LoRA.
  • plot/: contains scripts to replicate plots from the SnakModel research papers.
  • pretrain/: contains our multi-GPU pre-training scripts adapted from Megatron-LLM. Also include data collection and pre-processing.

Installation

For pre-training, please refer to the instructions in the pretrain/ folder. We recommend running pre-training in a separate installation in Docker.

For instruction-tuning, and subsequent analyses, simply install the required packages (ideally, in a virtual environment):

(venv) $ pip install -r requirements.txt

Citation

If you find the work in this repository useful, please don't forget to cite:

@inproceedings{snakmodel,
  title={{S}nak{M}odel: Lessons Learned from Training an Open Danish Large Language Model},
  author={Mike Zhang and Max M{\"u}ller-Eberstein and Elisa Bassignana and Rob van der Goot},
  booktitle={The Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies},
  year={2024},
  url={https://openreview.net/forum?id=YxzfgQGpRQ}
}

About

An LLM continually pre-trained specifically for Danish.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •