This repo contains the code for the paper "GuRE:Generative Query REwriter for Legal Passage Retrieval" (kim et al., 2025) from LePaRD.
You can download the dataset from the original LePaRD repo.
Please cite the following paper if you use LePaRD:
@inproceedings{kim-etal-2025-gure,
title = "{G}u{RE}:Generative Query {RE}writer for Legal Passage Retrieval",
author = "Kim, Daehui and
Kang, Deokhyung and
Kim, Jonghwi and
Ryu, Sangwon and
Lee, Gary",
editor = "Aletras, Nikolaos and
Chalkidis, Ilias and
Barrett, Leslie and
Goanț{\u{a}}, C{\u{a}}t{\u{a}}lina and
Preoțiuc-Pietro, Daniel and
Spanakis, Gerasimos",
booktitle = "Proceedings of the Natural Legal Language Processing Workshop 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.nllp-1.31/",
pages = "424--438",
ISBN = "979-8-89176-338-8",
abstract = {Legal Passage Retrieval (LPR) systems are crucial as they help practitioners save time when drafting legal arguments. However, it remains an underexplored avenue. One primary reason is the significant vocabulary mismatch between the query and the target passage. To address this, we propose a simple yet effective method, the $\textbf{G}$enerative q$\textbf{u}$ery $\textbf{RE}$writer $\textbf{(GuRE)}$. We leverage the generative capabilities of Large Language Models (LLMs) by training the LLM for query rewriting. $\textit{"Rewritten queries"}$ help retrievers to retrieve target passages by mitigating vocabulary mismatch. Experimental results show that GuRE significantly improves performance in a retriever-agnostic manner, outperforming all baseline methods. Further analysis reveals that different training objectives lead to distinct retrieval behaviors, making GuRE more suitable than direct retriever fine-tuning for real-world applications. Codes are avaiable at github.com/daehuikim/GuRE.}
}This codebase is built upon the original LePaRD repository. We extend our gratitude to the (Mahari et al., 2024) for providing the foundational dataset and baseline implementations.
This repository includes several enhancements and additional tools for working with the LePaRD dataset:
Generate Supervised Fine-Tuning (SFT) data from the LePaRD dataset for training language models.
Usage:
cd src/model
./run_make_sft_data.shThis script processes the testset_top_10000.csv.gz file and generates SFT training data in JSONL format, saved as sft_data_test_10000.jsonl.
Features:
- Flexible data processing with configurable file paths
- Progress tracking during data generation
- Automatic output directory creation
- UTF-8 encoding support for proper text handling
- Comprehensive error handling and file validation
Output Format: Each line in the generated JSONL file contains:
{
"ift_sample": "<instruction>### Preceding Context: {context}\n\n###Legal Passage: {passage}",
"quote": "original quote text"
}Train a language model using Supervised Fine-Tuning (SFT) with LoRA (Low-Rank Adaptation) on the LePaRD dataset.
Usage:
cd src/model
./run_sft_training.shThis script uses Saul-7B-Base as the baseline model and trains it on the generated SFT data.
Features:
- LoRA training for efficient fine-tuning
- 4-bit quantization for memory efficiency
- Flash Attention 2 for faster training
- Automatic evaluation during training
- BLEU score computation for text generation quality
Merge the trained LoRA weights with the base model to create a standalone model.
Usage:
cd src/model
./run_merge_lora.sh [lora_model_path] [output_path]Examples:
# Use default paths
./run_merge_lora.sh
# Specify custom paths
./run_merge_lora.sh ./my_lora_model ./my_merged_modelFeatures:
- Automatic device mapping for optimal memory usage
- Tokenizer preservation
- Comprehensive error handling
Generate legal passages using the trained model for inference on test data.
Usage:
cd src/model
./run_gure_inference.sh [model_path] [input_csv] [output_csv]Examples:
# Use default paths
./run_gure_inference.sh
# Specify custom paths
./run_gure_inference.sh ./my_model ../data/my_input.csv ../data/my_output.csvFeatures:
- VLLM-based efficient inference
- Tensor parallelism support
- Configurable sampling parameters
- Automatic output directory creation
Generate legal passages using OpenAI models with few-shot learning approach.
Usage:
cd src/model
./run_q2d_script.sh [config_type] [openai_key_path]Examples:
# Basic Q2D processing
./run_q2d_script.sh testset_q2d key.txt
# Q2D with Chain-of-Thought
./run_q2d_script.sh testset_q2dcot key.txtFeatures:
- OpenAI API integration with retry logic
- Parallel processing for efficiency
- Few-shot learning with BM25 retrieval
- Support for different configuration types
- Automatic output directory creation
LePaRD is a massive collection of U.S. federal judicial citations to precedent in context. LePaRD builds on millions of expert decisions by extracting quotations to precedents from judicial opinions along with the preceding context. Each row of the dataset corresponds to a quotation to prior case law used in a certain context.
- passage_id: A unique identifier for each passage
- destination_context: The preceding context before the quotation
- quote: The text of the passage that was quoted
- court: The court from which the passage originated
- date: The date when the opinion from which the passage originated was published
Contact Robert Mahari in case of any questions.
The original data can be downloaded here: https://www.dropbox.com/scl/fo/0pgqxcz0h2l4wta8yyvb3/ABjE8bNAnq3Vm2bBJziclPE?rlkey=zipkfcso0h9je1xne737ims02&st=6mgtpwa0&dl=0 To run the replication package, make sure to store all files in a folder called data.
Requires
- For bm25: anserini
- For dense retrieval: SBERT and Faiss
- For classification experiments: transformers
For example, the following should work:
conda create --name lepard python=3.10
conda activate lepard
pip install -r requirements.txt
First, split the data into train, dev and test. The output of this process can also be downloaded here.
python src/model/prepare_data.py# reformat input files
python src/model/bm25_pipeline.py
# run anserini and bm25 retrieval
path_anserini="/path/to/anserini"
num_labels="10000" # change this to 20000 / 50000 for other experiments
# build index
sh $path_anserini/target/appassembler/bin/IndexCollection -threads 1 -collection JsonCollection \
-generator DefaultLuceneDocumentGenerator -input bm25-files-$num_labels \
-index indexes/index-lepard-passages-$num_labels -storePositions -storeDocvectors -storeRaw
# retrieve passages devset
sh $path_anserini/target/appassembler/bin/SearchMsmarco -hits 10 -threads 1 \
-index indexes/index-lepard-passages-$num_labels \
-queries bm25-files-$num_labels/bm25_input_dev_$num_labels".tsv" \
-output bm25-files-$num_labels/bm25_output_dev.tsv/
# retrieve passages testset
sh $path_anserini/target/appassembler/bin/SearchMsmarco -hits 10 -threads 1 \
-index indexes/index-lepard-passages-$num_labels \
-queries bm25-files-$num_labels/bm25_input_test_$num_labels".tsv" \
-output bm25-files-$num_labels/bm25_output_test.tsv/
# evaluate
python src/model/evaluate_run.py --dev_predictions bm25-files-$num_labels/bm25_output_dev.tsv --test_predictions bm25-files-$num_labels/bm25_output_test.tsv --experiment bm25
num_labels="10000" # change this to 20000 / 50000 for other experiments
model_name="distilbert-base-uncased"
python src/model/train_classification_models.py --n_labels 10000 --model_name $model_name # trains default distilbert models and saves model and predictions in folder "finetuned-$model_name-$num_labels"
# evaluate
python src/model/evaluate_run.py --dev_predictions finetuned-distilbert-base-uncased-10000/predictions_devset_10000.json --test_predictions finetuned-distilbert-base-uncased-10000/predictions_testset_10000.json
# for all experiments, change num_labels to 20000 and 20000, and also run with legalbert
# zero-shot
num_labels="10000" # change this to 20000 / 50000 for other experiments
model_name="sentence-transformers/all-mpnet-base-v2"
python src/model/run_inference_sbert.py --model_name $model_name n_labels $num_labels # creates folder "predictions-sbert" and saves output there (os.path.basename(model_name) + predictions dev/test)
# evaluate
python src/model/evaluate_run.py --dev_predictions predictions-sbert/predictions_devset_all-mpnet-base-v2_$num_labels.json --test_predictions predictions-sbert/predictions_testset_all-mpnet-base-v2_$num_labels.json
# fine-tune
num_labels="10000" # change this to 20000 / 50000 for other experiments
python src/model/finetune_sbert.py --n_labels $num_labels # saves model in "sbert-finetuned-MultipleNegativesRankingLoss" + os.path.basename(args.model_name) + "-" + args.n_labels
model_name="sbert-finetuned-MultipleNegativesRankingLossall-distilroberta-v1-10000"
# run inference
python src/model/run_inference_sbert.py --model_name $model_name n_labels $num_labels # creates folder "predictions-sbert" and saves output there (os.path.basename(model_name) + predictions dev/test)
# evaluate
python src/model/evaluate_run.py --dev_predictions finetuned-distilbert-base-uncased-10000/predictions_devset_10000.json --test_predictions finetuned-distilbert-base-uncased-10000/predictions_testset_10000.json
The training data may be generated using the Case Law Access project dataset using the precedent_data_extraction.py script
The results of the paper "GuRE:Generative Query REwriter for Legal Passage Retrieval" is in results/