- To run all experments:
chmod +x scripts/main.sh
uv run scripts/main.sh- To run a model or decoding method specific experiment:
uv run inference.py \
--model_card [MODEL_CARD] --input [FACTUAL_QUESTION] --label [LABEL] --dola_layer [DOLA_LAYER_CONFIG]\
--max_new_tokens [MAX_NEW_TOKENS] --decoding_method [DECODING_METHOD]Each variable can take a value of your choice:
- MODEL_CARD: A model card from HuggingFace (Small models are not recommended as incremental benefits of DoLa might be limited).
- FACTUAL_QUESTION: A factual question.
- LABEL: A correct answer to the given question.
- DOLA_LAYER_CONFIG: Either
high(using the higher part of the model layers),low(using the lower part of the model layer), orNone(disable DoLa). - MAX_NEW_TOKENS: Max new tokens allowed.
default = 256. - DECODING_METHOD: A deterministic decoding method. Either
greedy(greedy search) orsample(top-p with p=0.90).
-
Selected models perform inference:
src/inference.py. -
Inference results are stored in the
results_inferencedirectory by model and decoding method.
- Edit the base script
src/inference.py. - Edit the sh script
scripts/main.shor add new scripts to thescriptsdirectory.
- Default Q&A set for inference is defined in
data/sample_questions.jsonlby hallucination categories:
{"id": 1, "category": "Factual/Entity Errors"}
{"id": 2, "category": "Context Coherence"}
{"id": 3, "category": "Temporal Information"}
{"id": 4, "category": "Complex Reasoning"}
{"id": 5, "category": "Admitting Ignorance"}- You can add, remove, or edit items at
data/sample_questions.jsonl.
-
Inference results are evaluated using auto metrics from HuggingFace and an LLM-as-a-Judge (gpt-5-mini)
-
To evaluate all results:
chmod +x scripts/eval.sh
uv run scripts/eval.sh- Evaluation scores are stored in the
results_evaldirectory, separated by model card.
- Edit
src/evaluation.pyto add, remove, edit any eval metrics. - Edit
llm_judge.pyto update an LLM-as-a-judge.
- The following command will save bar graphs (.jpg file) in the
results_figdirectory:
chmod +x scripts/visualize.sh
uv run scripts/visualize.sh- Install all dependencies:
uv venv
source .venv/bin/activate
uv sync- Add/remove dependenies:
uv add [PACKAGE]
uv remove [PACAKGE]- Reset the vertual environment:
rm -rf uv.lock .venv
uv cache clean
uv syncDecoding by Contrasting Layers (DoLa) was first introduced by Chuang et al in 2023 in the paper [2] as an inference-time strategy that intervenes in the conditional probability step and enhance the model’s factual knowledge.
The below diagram illustrates how DoLa works:
Figure A. DoLa for a transformer-based LM (Created by Kuriko IWAI)
From a model interpretability perspective, transformer-based language models (LMs) encode lower-level information in the lower (earlier) layers and more semantic information in the higher (later) layers [3], with its topmost layers containing the knowledge neurons that express factual knowledge they acquired in the pretraining process [4].
-
Lower layers contain low-level linguistic features, syntax, local context
-
Higher layers contain high-level semantic features, abstract reasoning, factual knowledge
DoLa exploits this modular encoding to amplify the factual knowledge through a contrastive decoding approach where the conditional probability for a next word is generated based on the difference in logits (raw prediction scores) between a higher layer and a lower layer.
In Figure A, greedy search selects “Ottawa” because the last layer (the 32th layer) of the transformer block predicts the highest conditional probability (72%) for that token.
DoLa, on the other hand, selects “Ottawa” because the adjusted logits using a contrast score between the 32nd and 24th layers for the token are the highest.
This approach helps emphasizing the factual knowledge of higher layers and downplaying knowledge of lower layers, making the model more factual and reducing hallucinations.
Standard LLMs compute the conditional probability of the next token
where
-
$v$ is a specific token from the vocabulary drawn from the vocabulary set$X$ , -
$x_{<t}$ is context, the sequence of all preceding tokens${x1, x2, \cdots, x_t−1}$ , -
$N$ : The final layer (mature layer) in the transformer, -
$h_t^{(N)}$ is the hidden state in the final layer of the transformer with$N$ stacked layers, and -
$ϕ(⋅)$ is the language head (size: X) from a final linear layer that projects the hidden state h into a vector of logits.
Instead of the standard Eq. (1), DoLa takes two major steps to compute the next token probability.
First, the prediction distribution
where
The premature layer
where
Because DoLa leverages the differences of logits between layers, it expects that the significant difference in logits of the layer
After selecting the premature layer M, DoLa computes the final probability for the next token such that
where
where the set of candidate tokens [5] such thats:
where
-
$q_N(x_t)$ is probability of the token$x_t$ in the mature layer$N$ being selected, -
$α ∈ [0, 1]$ is a confidence threshold (hyperparameter) to define the lower bound of the probability that the candidate token can take, and -
$w$ is any token in the entire vocabulary set$X$ .
In other word, Eq. (6) indicates that a token
And by computing the log-difference as defined in Eq. (5), the model attempts to weigh the tokens that the mature layer N predicts highly, but the less-informed layer M did not.
Eq. (3) represents the objective function of dynamically selecting a premature layer
On the other hand, DoLa-static runs experiments on all possible early layers using a validation set and picks the one with the best validation performance.
This approach is more intuitive than the dynamic selection, but has drawbacks of:
-
Requiring more hyperparameter search runs in layers and
-
Best layers are sensitive to data distribution, thus requiring in-distribution (ID) validation sets where samples are drawn from the same underlying probability distribution as the training data.
In common scenarios where perfectly ID validation sets are unavailable, DoLa-static selects different optimal layers when evaluated on different subsets randomly sampled from the original dataset.
Dynamic selection can mitigate these drawbacks by shrinking the search space of the premature layer and making the method more robust without heavily relying on ID validation sets [2].
[1]. Survey of Hallucination in Natural Language Generation (Ji et al., arXiv: 2202.03629)
[2]. DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models (Chuang et al., arXiv: 2309.03883)
[3]. BERT Rediscovers the Classical NLP Pipeline (Tenney et al., ACL 2019)
[4]. Knowledge Neurons in Pretrained Transformers (Dai et al., ACL 2022)
[5]. Contrastive Decoding: Open-ended Text Generation as Optimization (Li et al., ACL 2023)
[6]. CTRL: A Conditional Transformer Language Model for Controllable Generation (Keskur et al., arXiv 1909.05858)
