This repository provides a implementation of Pairwise Ranking Prompting (PRP) technique from Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting.
The provide implementaion for the three PRP-based ranking strategies highlighted in the paper:
all_pair
- We enumerate all pairs and perform a global aggregation to generate a score for each document.
- Intuitively, if the LLM consistently prefers
Document-AoverDocument-B,Document-Agets one point. - When the LLM is not sure by producing conflicting or irrelevant results (for the generation API), each document gets half a point. There might be ties for the aggregated scores, in which case we fall back to initial ranking.
-
PRP-all_pairis highly insensitive to input ordering. It essentially ranks documents with win ratio. - The clear drawback is its costly
$O(N^2)$ calls to LLM APIs, where N is the number of documents to be ranked for each query.
heapsort
- We use the pairwise preferences from the LLM as a comparator with HeapSort.
-
PRP-heapsortfavors lower computation complexity thanPRP-allpairwhile also being large insensitive to input orders. - This method makes
$O(N \log N)$ calls to LLM APIs.
sliding_k
- We use a sliding window that starts at the bottom of the initial ranking, compares pairs of documents, and swap document pairs with a stride of 1. One sliding window is similar to one pass of BubbleSort.
- Since we want to optimize for the
top-kresults, we perform the sliding window operationktimes. - Complexity is linear in
$O(N)$ timesk. -
PRP-sliding-khas favorable time complexity but has high dependency on input order.
Key Features of the PairwiseRankingPrompting implementation:
- Structured Generation with Pydantic Validation: These ranker leverage structured generation and robust Pydantic validation to ensure accurate zero-shot ranking, even on smaller LLMs.
- Efficient Sorting Algorithms: The
PairwisePromptingRankerutilizes efficient sorting methods (Heapsort and Bubblesort) to speed up inference. - Evaluation Toolkit: We provide a custom Evaluator and Dataloader for evaluating rankers on standard metrics (NDCG, MAP, Recall, Precision) at various cutoffs. The Dataloader efficiently loads and processes datasets using the
ir_datasetslibrary.
from prp import PairwiseRankingPrompting
# Initialize PRP-based reranker
reranker = PairwiseRankingPrompting(
model_name="meta-llama/Llama-3.1-8B-Instruct",
api_key="API_KEY",
base_url="BASE_URL"
)
query = "What are the benefits of regular exercise?"
documents = [
"Engaging in regular exercise enhances cardiovascular fitness and helps regulate blood pressure.",
"The Moon is Earth's only natural satellite and affects tides on the planet.",
"Going to the gym regularly can help build muscle strength and improve mental health.",
]
# Use sliding_k to pick top 2 items
results_sliding_k = reranker.rerank(
query, documents, method="sliding_k", top_k=2
)
# Rank all documents with heapsort
results_heapsort = reranker.rerank(
query, documents, method="heapsort"
)
# Rank all documents with all_pair
results_allpairs = reranker.rerank(
query, documents, method="all_pair"
)We evaluated the PairwiseRankingPrompting using pipelines built with the Haystack framework.
The evaluation was conducted on the following datasets: FIQA, SciFact, NFCorpus, TREC-19, and TREC-20.
The Mistral, Phi-3, and Llama-3 models were used with the PairwiseRankingPrompting ranker.
The evaluation pipelines can be found in the pipelines directory.
Evaluation Results:
We report the NDCG@10 scores for each dataset and method in the table below:
| Model | Ranker | FiQA | SciFACT | NFCorpus | TREC-19 | TREC-19 |
|---|---|---|---|---|---|---|
| Mistral | PRP-sliding_k | 0.4664 | 0.6847 | 0.4261 | 0.7062 | 0.6860 |
| Mistral | PRP-heapsort | 0.4672 | 0.6860 | 0.4311 | 0.7134 | 0.6875 |
| Mistral | PRP-allpair | 0.4676 | 0.6860 | 0.4312 | 0.7186 | 0.6987 |
| Phi-3 | PRP-sliding_k | 0.4704 | 0.6980 | 0.4365 | 0.7202 | 0.7140 |
| Phi-3 | PRP-heapsort | 0.4712 | 0.6990 | 0.4385 | 0.7226 | 0.7154 |
| Phi-3 | PRP-allpair | 0.4714 | 0.7028 | 0.4386 | 0.7228 | 0.7167 |
| Llama-3 | PRP-heapsort | 0.4764 | 0.7765 | 0.4423 | 0.7508 | 0.7637 |
| Llama-3 | PRP-sliding_k | 0.4793 | 0.7852 | 0.4503 | 0.7511 | 0.7642 |
| Llama-3 | PRP-allpair | 0.4992 | 0.7912 | 0.4658 | 0.7623 | 0.7671 |
- We find that
PRP-allpairperformed the best across all datasets. PRP-sliding_kandPRP-heapsortperform similarly across all datasets.- The
PRP-allpairwith the Llama-3 model performed the best across all datasets.