Benchmarking tool for evaluating text summarization methods on scientific papers.
This project uses uv for package management.
- Clone this repository
- Install dependencies
cd llm_summarization_benchmark uv sync uv run spacy download en_core_web_sm - Install AlignScore-large
mkdir -p Output/llm_summarization_benchmark cd Output/llm_summarization_benchmark wget https://huggingface.co/yzha/AlignScore/resolve/main/AlignScore-large.ckpt cd ../..
- Copy
Resources/example.envtoResources/.envand adjust - Run
uv run benchmark
Hint: Individual LLM config parameters are stored in
/src/llm_apis/config.py
The following files must be in place in order to load previous results:
Output/llm_summarization_benchmark/benchmark.pklOutput/llm_apis/cache.json
Afterwards, simply run the benchmark again - processed results will be skipped.
Document store in Resources folder, containing ID, title, abstract and reference summaries.
1-N reference summaries can be provided per paper.
Multiple reference summaries improve evaluation robustness and reduce single-annotator bias.
[
{
"title": "Paper Title",
"abstract": "Paper abstract text...",
"id": "paper_001",
"summaries": [
"This paper analyzes ..",
"The paper investigates .. "
]
}
]- Highlight sections of Elsevier and Cell papers, manually extracted and joined by ". ".
- Tokenizes sentences (nltk)
- Creates TF-IDF vectors for sentence representation (sklearn)
- Calculates cosine similarities between TF-IDF vectors (sklearn)
- Builds similarity graph with cosine similarities as edge weights (networkx)
- Applies PageRank to rank sentences by importance (networkx)
- Selects highest-scoring sentences within word count limits while preserving original order
WARNING: Results might be misleading when gold-standard summaries are (partial) copies from the source document, rather than being abstractive
- Calculates word frequency scores
- ranks sentences by avg. word frequency (excluding stopwords (nltk)
- selects highest-scoring sentences (in original order) within word count limits
- Anthropic, Mistral, OpenAI, HuggingFace, Ollama`
Each generated summary is evaluated against all available gold-standard reference summaries of a document using a number of metrics as listed below. For each metric, mean/min/max/std and individual counts are computed.
Set of metrics for evaluating summary quality by comparing to reference summaries. wiki | package | publication
- ROUGE-N: N-gram co-occurrence statistics between system and reference summaries.
- ROUGE-1: Overlap of unigrams (individual words)
- ROUGE-2: Overlap of bigrams (word pairs)
- ROUGE-L: Longest Common Subsequence (LCS) based statistics that capture sentence-level structure similarity by awarding credit only to in-sequence word matches.
Semantic similarity using BERT embeddings. paper | package
roberta-large: Default model paper | modelmicrosoft/deberta-xlarge-mnli: Proposed as "better model" paper | model)
Matches words through exact matches, stemming, synonyms, and considers word order. Claims to outperform BLEU. paper | function
N-gram overlaps with brevity penalty. paper | function
Semantic similarity using sentence transformers. Compares generated summary directly against the source document (rather than reference summaries like other metrics). model
Factual consistency evaluation using the abstract. paper | modified repository
- Execution Time: Processing time
- Length Compliance Metrics
- Within Bounds: Percentage meeting length constraints
- Too Short/Long: Violation statistics with percentages
- Average Length: Mean word count with standard deviation
- Length Distribution: Detailed statistical analysis
Systematic evaluation and benchmarking of text summarization methods for biomedical literature: From word-frequency methods to language models Baumgärtel F, Bono E, Fillinger L, Galou L, Kęska-Izworska K, Walter S, Andorfer P, Kratochwill K, Perco P, Ley M bioRxiv 2026, doi.org/10.64898/2026.01.09.697335