📝 Paper (arXiv): https://arxiv.org/abs/2502.00657
🧵 NeurIPS 2025 page: https://neurips.cc/virtual/2025/loc/san-diego/poster/117222
This repository is based on our NeurIPS 2025 paper “LLM Safety Alignment is Divergence Estimation in Disguise” and provides the official implementation of the KL Divergence Optimizer (KLDO) and related experiments.
Alignment loss estimating KL divergence between aligned (Preferred/chosen/safe responses
where
This repo builds upon the HALOs repo. For understanding the base functionality refer to the original HALOs repo.
We have added our own experiments and trainers to supplement the theory in our paper.
- We have added our own implementation of BCO, KLDO as
BCOTrainer,KLTrainerintrain/traners.py - To generate compliance-refusal or preference type datasets, refer to
/dataset_generation/generate.pyand/dataset_generation/generate.pyrespectively. The generated datasets can be directly read from jsonl files/dataset_generation/Base_accept_reject.jsonl;/dataset_generation/Base_preference.jsonl
-
To install the dependencies, clone the repo and run.
. install.sh -
An example script to run the
KLDOoptimizer with compliance-refusal dataset and default configurations onMistralv0.1-7bcan be implemented by runningbash sample_launch.sh- The script provides a general template, change the
loss=klfor KLDO or any of the losses from./config/loss/ - For other models again choose from any of the models defined in
./config/model/or define yourcustom_model_config.yamlto run inside the script. - To switch to the preference dataset for training, change
datasets=[pref]instead ofdatasets=[cr].
- The script provides a general template, change the
-
Once the training is done model weights are stored in
cache/exp_name/FINAL/. -
By default the training uses LoRA, and we need to merge the trained weights with the original model weights. To do so, refer to the
merge.shscript. Change thebase, lora, outparameters to define the corresponding base model name, lora weights dir and output_dir. -
Once the model weights are merged, they are ready to use for evaluation experiments:
metrics.shcan be run with corresponding parameters to generate separation visuzalizations, bhattachrya_distance and silhouette score for the given model of interest./safety_eval/clean_asr_script.shcan be run with corresponding parameters to evaluate the attack success rates.
-
For more custom configurations like new models, and training parameters config the
config.yamland/model/model_name.yamlfiles.
If you use KLDO any part of this repository in your research, please cite:
@article{haldar2025llm,
title={LLM Safety Alignment is Divergence Estimation in Disguise},
author={Haldar, Rajdeep and Wang, Ziyi and Song, Qifan and Lin, Guang and Xing, Yue},
journal={arXiv preprint arXiv:2502.00657},
year={2025}
}