Repository for EMNLP 2025 Paper - The Medium is Not the Message: Deconfounding Document Embeddings via Linear Concept Erasure
Embedding-based similarity metrics between text sequences can be affected not only by the content dimensions of interest but also by spurious attributes such as source or language. These document-level confounders pose challenges for many applications, particularly those combining texts from different corpora. We demonstrate that a debiasing algorithm removing information about observed confounders from encoder representations significantly improves similarity and clustering metrics across tasks without degrading out-of-distribution performance.
The current pipeline generates and saves plots for the following visualizations:
- K-means clusterings, both before and after LEACE erasure;
- PCA projections, before and after LEACE erasure;
- Top-k retrieval results, before and after erasure;
- Total # of exact pairs, before and after erasure.
In addition, we provide all data used in our paper.
If you find our work helpful, please consider citing us:
@article{fan2025medium,
title = {The Medium Is Not the Message: Deconfounding Document Embeddings via Linear Concept Erasure},
author = {Fan, Yu and Tian, Yang and Ravfogel, Shauli and Sachan, Mrinmaya and Ash, Elliott and Hoyle, Alexander},
journal = {arXiv preprint arXiv:2507.01234},
year = {2025}
}