Skip to content
/ GuRE Public

Official Repository for paper "GuRE:Generative Query REwriter for Legal Passage Retrieval".

License

Notifications You must be signed in to change notification settings

daehuikim/GuRE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GuRE:Generative Query REwriter for Legal Passage Retrieval

This repo contains the code for the paper "GuRE:Generative Query REwriter for Legal Passage Retrieval" (kim et al., 2025) from LePaRD.

dataset

You can download the dataset from the original LePaRD repo.

Reference

Please cite the following paper if you use LePaRD:

@inproceedings{kim-etal-2025-gure,
    title = "{G}u{RE}:Generative Query {RE}writer for Legal Passage Retrieval",
    author = "Kim, Daehui  and
      Kang, Deokhyung  and
      Kim, Jonghwi  and
      Ryu, Sangwon  and
      Lee, Gary",
    editor = "Aletras, Nikolaos  and
      Chalkidis, Ilias  and
      Barrett, Leslie  and
      Goanț{\u{a}}, C{\u{a}}t{\u{a}}lina  and
      Preoțiuc-Pietro, Daniel  and
      Spanakis, Gerasimos",
    booktitle = "Proceedings of the Natural Legal Language Processing Workshop 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.nllp-1.31/",
    pages = "424--438",
    ISBN = "979-8-89176-338-8",
    abstract = {Legal Passage Retrieval (LPR) systems are crucial as they help practitioners save time when drafting legal arguments. However, it remains an underexplored avenue. One primary reason is the significant vocabulary mismatch between the query and the target passage. To address this, we propose a simple yet effective method, the $\textbf{G}$enerative q$\textbf{u}$ery $\textbf{RE}$writer $\textbf{(GuRE)}$. We leverage the generative capabilities of Large Language Models (LLMs) by training the LLM for query rewriting. $\textit{"Rewritten queries"}$ help retrievers to retrieve target passages by mitigating vocabulary mismatch. Experimental results show that GuRE significantly improves performance in a retriever-agnostic manner, outperforming all baseline methods. Further analysis reveals that different training objectives lead to distinct retrieval behaviors, making GuRE more suitable than direct retriever fine-tuning for real-world applications. Codes are avaiable at github.com/daehuikim/GuRE.}
}

Acknowledgement

This codebase is built upon the original LePaRD repository. We extend our gratitude to the (Mahari et al., 2024) for providing the foundational dataset and baseline implementations.

Additional Features

This repository includes several enhancements and additional tools for working with the LePaRD dataset:

SFT Data Generation

Generate Supervised Fine-Tuning (SFT) data from the LePaRD dataset for training language models.

Usage:

cd src/model
./run_make_sft_data.sh

This script processes the testset_top_10000.csv.gz file and generates SFT training data in JSONL format, saved as sft_data_test_10000.jsonl.

Features:

  • Flexible data processing with configurable file paths
  • Progress tracking during data generation
  • Automatic output directory creation
  • UTF-8 encoding support for proper text handling
  • Comprehensive error handling and file validation

Output Format: Each line in the generated JSONL file contains:

{
  "ift_sample": "<instruction>### Preceding Context: {context}\n\n###Legal Passage: {passage}",
  "quote": "original quote text"
}

SFT Training with LoRA

Train a language model using Supervised Fine-Tuning (SFT) with LoRA (Low-Rank Adaptation) on the LePaRD dataset.

Usage:

cd src/model
./run_sft_training.sh

This script uses Saul-7B-Base as the baseline model and trains it on the generated SFT data.

Features:

  • LoRA training for efficient fine-tuning
  • 4-bit quantization for memory efficiency
  • Flash Attention 2 for faster training
  • Automatic evaluation during training
  • BLEU score computation for text generation quality

LoRA Model Merging

Merge the trained LoRA weights with the base model to create a standalone model.

Usage:

cd src/model
./run_merge_lora.sh [lora_model_path] [output_path]

Examples:

# Use default paths
./run_merge_lora.sh

# Specify custom paths
./run_merge_lora.sh ./my_lora_model ./my_merged_model

Features:

  • Automatic device mapping for optimal memory usage
  • Tokenizer preservation
  • Comprehensive error handling

GuRE Inference

Generate legal passages using the trained model for inference on test data.

Usage:

cd src/model
./run_gure_inference.sh [model_path] [input_csv] [output_csv]

Examples:

# Use default paths
./run_gure_inference.sh

# Specify custom paths
./run_gure_inference.sh ./my_model ../data/my_input.csv ../data/my_output.csv

Features:

  • VLLM-based efficient inference
  • Tensor parallelism support
  • Configurable sampling parameters
  • Automatic output directory creation

Q2D (Query-to-Document) Processing

Generate legal passages using OpenAI models with few-shot learning approach.

Usage:

cd src/model
./run_q2d_script.sh [config_type] [openai_key_path]

Examples:

# Basic Q2D processing
./run_q2d_script.sh testset_q2d key.txt

# Q2D with Chain-of-Thought
./run_q2d_script.sh testset_q2dcot key.txt

Features:

  • OpenAI API integration with retry logic
  • Parallel processing for efficiency
  • Few-shot learning with BM25 retrieval
  • Support for different configuration types
  • Automatic output directory creation

Description

LePaRD is a massive collection of U.S. federal judicial citations to precedent in context. LePaRD builds on millions of expert decisions by extracting quotations to precedents from judicial opinions along with the preceding context. Each row of the dataset corresponds to a quotation to prior case law used in a certain context.

  • passage_id: A unique identifier for each passage
  • destination_context: The preceding context before the quotation
  • quote: The text of the passage that was quoted
  • court: The court from which the passage originated
  • date: The date when the opinion from which the passage originated was published

Contact Robert Mahari in case of any questions.

Data

The original data can be downloaded here: https://www.dropbox.com/scl/fo/0pgqxcz0h2l4wta8yyvb3/ABjE8bNAnq3Vm2bBJziclPE?rlkey=zipkfcso0h9je1xne737ims02&st=6mgtpwa0&dl=0 To run the replication package, make sure to store all files in a folder called data.

Installation

Requires

For example, the following should work:

conda create --name lepard python=3.10
conda activate lepard
pip install -r requirements.txt

Experiments

First, split the data into train, dev and test. The output of this process can also be downloaded here.

python src/model/prepare_data.py

bm25 experiments

# reformat input files
python src/model/bm25_pipeline.py

# run anserini and bm25 retrieval
path_anserini="/path/to/anserini"
num_labels="10000" # change this to 20000 / 50000 for other experiments

# build index
sh $path_anserini/target/appassembler/bin/IndexCollection -threads 1 -collection JsonCollection \
 -generator DefaultLuceneDocumentGenerator -input bm25-files-$num_labels \
 -index indexes/index-lepard-passages-$num_labels -storePositions -storeDocvectors -storeRaw 

# retrieve passages devset
sh $path_anserini/target/appassembler/bin/SearchMsmarco -hits 10 -threads 1 \
 -index indexes/index-lepard-passages-$num_labels \
 -queries bm25-files-$num_labels/bm25_input_dev_$num_labels".tsv" \
 -output bm25-files-$num_labels/bm25_output_dev.tsv/

# retrieve passages testset
sh $path_anserini/target/appassembler/bin/SearchMsmarco -hits 10 -threads 1 \
 -index indexes/index-lepard-passages-$num_labels \
 -queries bm25-files-$num_labels/bm25_input_test_$num_labels".tsv" \
 -output bm25-files-$num_labels/bm25_output_test.tsv/

# evaluate
python src/model/evaluate_run.py --dev_predictions bm25-files-$num_labels/bm25_output_dev.tsv --test_predictions bm25-files-$num_labels/bm25_output_test.tsv --experiment bm25

classification experiments

num_labels="10000" # change this to 20000 / 50000 for other experiments
model_name="distilbert-base-uncased"
python src/model/train_classification_models.py --n_labels 10000 --model_name $model_name # trains default distilbert models and saves model and predictions in folder "finetuned-$model_name-$num_labels"
# evaluate
python src/model/evaluate_run.py --dev_predictions finetuned-distilbert-base-uncased-10000/predictions_devset_10000.json --test_predictions finetuned-distilbert-base-uncased-10000/predictions_testset_10000.json 
# for all experiments, change num_labels to 20000 and 20000, and also run with legalbert 

SBERT experiments

# zero-shot
num_labels="10000" # change this to 20000 / 50000 for other experiments
model_name="sentence-transformers/all-mpnet-base-v2"
python src/model/run_inference_sbert.py --model_name $model_name n_labels $num_labels # creates folder "predictions-sbert" and saves output there (os.path.basename(model_name) + predictions dev/test)
# evaluate
python src/model/evaluate_run.py --dev_predictions predictions-sbert/predictions_devset_all-mpnet-base-v2_$num_labels.json --test_predictions predictions-sbert/predictions_testset_all-mpnet-base-v2_$num_labels.json
# fine-tune
num_labels="10000" # change this to 20000 / 50000 for other experiments
python src/model/finetune_sbert.py --n_labels $num_labels # saves model in "sbert-finetuned-MultipleNegativesRankingLoss" + os.path.basename(args.model_name) + "-" + args.n_labels
model_name="sbert-finetuned-MultipleNegativesRankingLossall-distilroberta-v1-10000"
# run inference
python src/model/run_inference_sbert.py --model_name $model_name n_labels $num_labels # creates folder "predictions-sbert" and saves output there (os.path.basename(model_name) + predictions dev/test)
# evaluate
python src/model/evaluate_run.py --dev_predictions finetuned-distilbert-base-uncased-10000/predictions_devset_10000.json --test_predictions finetuned-distilbert-base-uncased-10000/predictions_testset_10000.json 

Data replication

The training data may be generated using the Case Law Access project dataset using the precedent_data_extraction.py script

The results of the paper "GuRE:Generative Query REwriter for Legal Passage Retrieval" is in results/

About

Official Repository for paper "GuRE:Generative Query REwriter for Legal Passage Retrieval".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published