KLDO: KL Divergence Optimizer for LLM Alignment

📝 Paper (arXiv): https://arxiv.org/abs/2502.00657
🧵 NeurIPS 2025 page: https://neurips.cc/virtual/2025/loc/san-diego/poster/117222

This repository is based on our NeurIPS 2025 paper “LLM Safety Alignment is Divergence Estimation in Disguise” and provides the official implementation of the KL Divergence Optimizer (KLDO) and related experiments.

Introduction

KLDO Loss

Alignment loss estimating KL divergence between aligned (Preferred/chosen/safe responses $\sim D^+$) and unaligned distribution (Unpreferred/rejected/harmful responses $\sim D^-$).

$$L_{KLDO}(\theta)= -\mathbb{E}_{D^+}r_\theta(x,y)+\ln \left(\mathbb{E}_{D^-}e^{r_\theta(x,y)}\right)$$

where $r_\theta(x,y)=\beta\ln\frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}$.

Disclaimer

This repo builds upon the HALOs repo. For understanding the base functionality refer to the original HALOs repo.

We have added our own experiments and trainers to supplement the theory in our paper.

We have added our own implementation of BCO, KLDO as BCOTrainer, KLTrainer in train/traners.py
To generate compliance-refusal or preference type datasets, refer to /dataset_generation/generate.py and /dataset_generation/generate.py respectively. The generated datasets can be directly read from jsonl files /dataset_generation/Base_accept_reject.jsonl; /dataset_generation/Base_preference.jsonl

Usage Guide

To install the dependencies, clone the repo and run.
```
. install.sh
```
An example script to run the KLDO optimizer with compliance-refusal dataset and default configurations on Mistralv0.1-7b can be implemented by running
```
   bash sample_launch.sh
```
- The script provides a general template, change the loss=kl for KLDO or any of the losses from ./config/loss/
- For other models again choose from any of the models defined in ./config/model/ or define your custom_model_config.yaml to run inside the script.
- To switch to the preference dataset for training, change datasets=[pref] instead of datasets=[cr].
Once the training is done model weights are stored in cache/exp_name/FINAL/.
By default the training uses LoRA, and we need to merge the trained weights with the original model weights. To do so, refer to the merge.sh script. Change the base, lora, out parameters to define the corresponding base model name, lora weights dir and output_dir.
Once the model weights are merged, they are ready to use for evaluation experiments:
- metrics.sh can be run with corresponding parameters to generate separation visuzalizations, bhattachrya_distance and silhouette score for the given model of interest.
- /safety_eval/clean_asr_script.sh can be run with corresponding parameters to evaluate the attack success rates.
For more custom configurations like new models, and training parameters config the config.yaml and /model/model_name.yaml files.

📚 Citation

If you use KLDO any part of this repository in your research, please cite:

@article{haldar2025llm,
  title={LLM Safety Alignment is Divergence Estimation in Disguise},
  author={Haldar, Rajdeep and Wang, Ziyi and Song, Qifan and Lin, Guang and Xing, Yue},
  journal={arXiv preprint arXiv:2502.00657},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 184 Commits
accelerate_config		accelerate_config
assets		assets
config		config
dataset_generation		dataset_generation
eval		eval
safety_eval		safety_eval
train		train
.gitignore		.gitignore
LICENSE		LICENSE
LLM generations.pdf		LLM generations.pdf
Motivation.pdf		Motivation.pdf
README.md		README.md
__init__.py		__init__.py
alpaca_eval.sh		alpaca_eval.sh
bhatta_dist.py		bhatta_dist.py
eval.py		eval.py
eval_utils.py		eval_utils.py
generate.py		generate.py
generate_sbatch.sh		generate_sbatch.sh
install.sh		install.sh
launch.py		launch.py
merge.py		merge.py
merge.sh		merge.sh
metrics.py		metrics.py
metrics.sh		metrics.sh
plot_sep_vs_asr (1).pdf		plot_sep_vs_asr (1).pdf
sample_launch.sh		sample_launch.sh
template.jinja		template.jinja

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KLDO: KL Divergence Optimizer for LLM Alignment

Introduction

KLDO Loss

Disclaimer

Usage Guide

📚 Citation

About

Uh oh!

Releases

Packages

Languages

License

rhaldarpurdue/KLDO

Folders and files

Latest commit

History

Repository files navigation

KLDO: KL Divergence Optimizer for LLM Alignment

Introduction

KLDO Loss

Disclaimer

Usage Guide

📚 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages