GitHub - y-fn/deconfounding-embeddings: [EMNLP 2025] The Medium is Not the Message: Deconfounding Document Embeddings via Linear Concept Erasure

Repository for EMNLP 2025 Paper - The Medium is Not the Message: Deconfounding Document Embeddings via Linear Concept Erasure

Introduction

Embedding-based similarity metrics between text sequences can be affected not only by the content dimensions of interest but also by spurious attributes such as source or language. These document-level confounders pose challenges for many applications, particularly those combining texts from different corpora. We demonstrate that a debiasing algorithm removing information about observed confounders from encoder representations significantly improves similarity and clustering metrics across tasks without degrading out-of-distribution performance.

Erasure Pipeline

The current pipeline generates and saves plots for the following visualizations:

K-means clusterings, both before and after LEACE erasure;
PCA projections, before and after LEACE erasure;
Top-k retrieval results, before and after erasure;
Total # of exact pairs, before and after erasure.

Data

In addition, we provide all data used in our paper.

Citation

If you find our work helpful, please consider citing us:

@article{fan2025medium,
  title      =   {The Medium Is Not the Message: Deconfounding Document Embeddings via Linear Concept Erasure},
  author     =   {Fan, Yu and Tian, Yang and Ravfogel, Shauli and Sachan, Mrinmaya and Ash, Elliott and Hoyle, Alexander},
  journal    =   {arXiv preprint arXiv:2507.01234},
  year       =   {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
data		data
CITATION.bib		CITATION.bib
LICENSE		LICENSE
README.md		README.md
cluster_example.py		cluster_example.py
erasure_CAP.ipynb		erasure_CAP.ipynb
style_erasure_pipeline.py		style_erasure_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Repository for EMNLP 2025 Paper - The Medium is Not the Message: Deconfounding Document Embeddings via Linear Concept Erasure

Introduction

Erasure Pipeline

Data

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

y-fn/deconfounding-embeddings

Folders and files

Latest commit

History

Repository files navigation

Repository for EMNLP 2025 Paper - The Medium is Not the Message: Deconfounding Document Embeddings via Linear Concept Erasure

Introduction

Erasure Pipeline

Data

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages