Scientific Paper Summarization Benchmark

Benchmarking tool for evaluating text summarization methods on scientific papers.

Quick Start

This project uses uv for package management.

Clone this repository

Install dependencies

cd llm_summarization_benchmark  
uv sync
uv run spacy download en_core_web_sm

Install AlignScore-large

mkdir -p Output/llm_summarization_benchmark
cd Output/llm_summarization_benchmark
wget https://huggingface.co/yzha/AlignScore/resolve/main/AlignScore-large.ckpt
cd ../..

Copy Resources/example.env to Resources/.env and adjust
Run
```
uv run benchmark
```

Hint: Individual LLM config parameters are stored in /src/llm_apis/config.py

Run the visualization without benchmarking

The following files must be in place in order to load previous results:

Output/llm_summarization_benchmark/benchmark.pkl
Output/llm_apis/cache.json

Afterwards, simply run the benchmark again - processed results will be skipped.

Workflow

text_summarization_goldstandard_data.json

Document store in Resources folder, containing ID, title, abstract and reference summaries. 1-N reference summaries can be provided per paper. Multiple reference summaries improve evaluation robustness and reduce single-annotator bias.

[
  {
    "title": "Paper Title",
    "abstract": "Paper abstract text...",
    "id": "paper_001",
    "summaries": [
      "This paper analyzes ..",
      "The paper investigates .. "
    ]
  }
]

Reference summary sources

Highlight sections of Elsevier and Cell papers, manually extracted and joined by ". ".

Summarization Methods

local:textrank

Tokenizes sentences (nltk)
Creates TF-IDF vectors for sentence representation (sklearn)
Calculates cosine similarities between TF-IDF vectors (sklearn)
Builds similarity graph with cosine similarities as edge weights (networkx)
Applies PageRank to rank sentences by importance (networkx)
Selects highest-scoring sentences within word count limits while preserving original order

WARNING: Results might be misleading when gold-standard summaries are (partial) copies from the source document, rather than being abstractive

local:frequency

Calculates word frequency scores
ranks sentences by avg. word frequency (excluding stopwords (nltk)
selects highest-scoring sentences (in original order) within word count limits

LLM Providers

Anthropic, Mistral, OpenAI, HuggingFace, Ollama`

Evaluation Metrics

Each generated summary is evaluated against all available gold-standard reference summaries of a document using a number of metrics as listed below. For each metric, mean/min/max/std and individual counts are computed.

Rouge

Set of metrics for evaluating summary quality by comparing to reference summaries. wiki | package | publication

ROUGE-N: N-gram co-occurrence statistics between system and reference summaries.
- ROUGE-1: Overlap of unigrams (individual words)
- ROUGE-2: Overlap of bigrams (word pairs)
ROUGE-L: Longest Common Subsequence (LCS) based statistics that capture sentence-level structure similarity by awarding credit only to in-sequence word matches.

Bert

Semantic similarity using BERT embeddings. paper | package

roberta-large: Default model paper | model
microsoft/deberta-xlarge-mnli: Proposed as "better model" paper | model)

Meteor

Matches words through exact matches, stemming, synonyms, and considers word order. Claims to outperform BLEU. paper | function

BLEU

N-gram overlaps with brevity penalty. paper | function

all-mpnet-base-v2

Semantic similarity using sentence transformers. Compares generated summary directly against the source document (rather than reference summaries like other metrics). model

AlignScore

Factual consistency evaluation using the abstract. paper | modified repository

Further Metrics

Execution Time: Processing time
Length Compliance Metrics
- Within Bounds: Percentage meeting length constraints
- Too Short/Long: Violation statistics with percentages
- Average Length: Mean word count with standard deviation
- Length Distribution: Detailed statistical analysis

Reference

Systematic evaluation and benchmarking of text summarization methods for biomedical literature: From word-frequency methods to language models Baumgärtel F, Bono E, Fillinger L, Galou L, Kęska-Izworska K, Walter S, Andorfer P, Kratochwill K, Perco P, Ley M bioRxiv 2026, doi.org/10.64898/2026.01.09.697335

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
Output		Output
Resources		Resources
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scientific Paper Summarization Benchmark

Quick Start

Run the visualization without benchmarking

Workflow

text_summarization_goldstandard_data.json

Reference summary sources

Summarization Methods

local:textrank

local:frequency

LLM Providers

Evaluation Metrics

Rouge

Bert

Meteor

BLEU

all-mpnet-base-v2

AlignScore

Further Metrics

Reference

About

Uh oh!

Releases

Packages

Languages

Delta4AI/LLMTextSummarizationBenchmark

Folders and files

Latest commit

History

Repository files navigation

Scientific Paper Summarization Benchmark

Quick Start

Run the visualization without benchmarking

Workflow

text_summarization_goldstandard_data.json

Reference summary sources

Summarization Methods

local:textrank

local:frequency

LLM Providers

Evaluation Metrics

Rouge

Bert

Meteor

BLEU

all-mpnet-base-v2

AlignScore

Further Metrics

Reference

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages