[A python package] (https://github.com/Yale-LILY/SummEval)
- Kryściński, Wojciech, et al. “Neural text summarization: A critical evaluation.” .
- Zhang, Tianyi, et al. “Benchmarking large language models for news summarization.”. code
- Kryściński, Wojciech, et al. “Evaluating the factual consistency of abstractive text summarization.” code
- Pagnoni, Artidoro, Vidhisha Balachandran, and Yulia Tsvetkov. “Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics.”. code
- Maynez, Joshua, et al. “On faithfulness and factuality in abstractive summarization.” .
- Lin, Chin-Yew. “Rouge: A package for automatic evaluation of summaries.”.
- Banerjee, Satanjeev, and Alon Lavie. “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments.”
- Zhang, Tianyi, et al. “Bertscore: Evaluating text generation with bert.” code
- Zhao, Wei, et al. “MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance.” code
- Fabbri, Alexander R., et al. “Summeval: Re-evaluating summarization evaluation.”
- He, Tingting, et al. “ROUGE-C: A fully automated evaluation method for multi-document summarization.”
- Liu, Yang, et al. “Gpteval: NLG evaluation using gpt-4 with better human alignment.”
- Laban, Philippe, et al. “SummaC: Re-visiting NLI-based models for inconsistency detection in summarization.”
- Gekhman, Zorik, et al. “Trueteacher: Learning factual consistency evaluation with large language models.”
- Scialom, Thomas, et al. “Answers unite! unsupervised metrics for reinforced summarization models.”
- Durmus, Esin, He He, and Mona Diab. “FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization.”
- Scialom, Thomas, et al. “Questeval: Summarization asks for fact-based evaluation.”
- Fabbri, Alexander R., et al. “QAFactEval: Improved QA-based factual consistency evaluation for summarization.”
- Böhm, Florian, et al. “Better rewards yield better summaries: Learning to summarise without references.”
- Stiennon, Nisan, et al. “Learning to summarize with human feedback.”
- Wu, Jeff, et al. “Recursively summarizing books with human feedback.”
- Manakul, Potsawee, Adian Liusie, and Mark JF Gales. “Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models.”