Skip to content

[EMNLP 2025] The Medium is Not the Message: Deconfounding Document Embeddings via Linear Concept Erasure

License

Notifications You must be signed in to change notification settings

y-fn/deconfounding-embeddings

Repository files navigation

ACL arXiv license

Introduction

Embedding-based similarity metrics between text sequences can be affected not only by the content dimensions of interest but also by spurious attributes such as source or language. These document-level confounders pose challenges for many applications, particularly those combining texts from different corpora. We demonstrate that a debiasing algorithm removing information about observed confounders from encoder representations significantly improves similarity and clustering metrics across tasks without degrading out-of-distribution performance.

Erasure Pipeline

The current pipeline generates and saves plots for the following visualizations:

  • K-means clusterings, both before and after LEACE erasure;
  • PCA projections, before and after LEACE erasure;
  • Top-k retrieval results, before and after erasure;
  • Total # of exact pairs, before and after erasure.

Data

In addition, we provide all data used in our paper.

Citation

If you find our work helpful, please consider citing us:

@article{fan2025medium,
  title      =   {The Medium Is Not the Message: Deconfounding Document Embeddings via Linear Concept Erasure},
  author     =   {Fan, Yu and Tian, Yang and Ravfogel, Shauli and Sachan, Mrinmaya and Ash, Elliott and Hoyle, Alexander},
  journal    =   {arXiv preprint arXiv:2507.01234},
  year       =   {2025}
}

About

[EMNLP 2025] The Medium is Not the Message: Deconfounding Document Embeddings via Linear Concept Erasure

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •