Skip to content

Official repository for LongTableBench (EMNLP'25 paper), a comprehensive benchmark for evaluating long-context reasoning over semi-structured tables across diverse formats, tasks, and domains.

License

Notifications You must be signed in to change notification settings

liyaooi/LongTableBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LongTableBench: Benchmarking Long-Context Table Reasoning Across Real-World Formats and Domains

Official repository for LongTableBench (EMNLP'25 paper), a comprehensive benchmark for evaluating long-context reasoning over semi-structured tables across diverse formats, tasks, and domains.


🌟 Overview

LongTableBench is the first multitask benchmark designed to evaluate the reasoning ability of large language models (LLMs) over long-context semi-structured tables. It features diverse tasks, formats, and domains, ensuring comprehensive coverage of real-world reasoning challenges.

Key Features

  • 5,950 QA instances derived from 850 seed questions
  • 7 table formats: Markdown, HTML, JSON, LaTeX, SQL, XML, CSV
  • 18 domains, covering medical, finance, education, entertainment, etc.
  • Context lengths up to 128K tokens
  • Single-turn & multi-turn, single-table & multi-table scenarios
  • Rigorous symbolic verification, cross-model validation, and human review

🧩 Task Overview

LongTableBench includes six carefully designed tasks, evaluating three fundamental dimensions: Structural Complexity, Long-Range Dependencies, and Semantic Integration.

Task Abbr. Description Primary Challenge
Exact Matching EM Locate and extract exact cell values while maintaining structural awareness. Structural
Basic Conditional Filtering BCF Apply simple logical filters (e.g., select rows by condition). Structural
Fuzzy Conditional Manipulation FCM Handle sorting, aggregation, and approximate matching under fuzzy conditions. Long-Range
Fact Retrieval FR Retrieve or verify facts grounded in tabular evidence. Long-Range
External Knowledge Fusion EKF Combine tabular data with external or commonsense knowledge. Semantic
Irregular Numerical Interpretation INI Interpret unconventional numerical formats (Roman numerals, date variants, etc.). Semantic

Each task can appear in single-turn / multi-turn and single-table / multi-table settings.


πŸ“ˆ Dataset Statistics

  • Total instances: 5,950
  • Length distribution:
    • 40% short (0–8K tokens)
    • 35% medium (8K–32K)
    • 25% long (32K–128K)
  • Task proportion:
    • 35% Structural (EM, BCF)
    • 35% Long-range (FCM, FR)
    • 30% Semantic (EKF, INI)


🧾 Benchmark Results

πŸ“‹ Complete Results Table (Click to expand)
Models EM BCF FCM FR EKF INI Avg. MD HTML JSON SQL XML LaTeX CSV Avg. Range 8K 32K 128K
Llama-3.2-3B-Instruct 24.30 14.78 13.76 2.99 11.99 9.68 13.94 14.35 13.36 11.72 16.10 13.14 12.35 15.23 13.75 31.83 21.76 11.92 8.15
Qwen2.5-3B-Instruct 17.14 7.07 9.40 33.74 36.44 18.63 20.40 21.04 20.19 22.46 24.66 21.00 20.11 19.25 21.25 25.49 27.20 18.21 15.78
Phi-3-mini-128k-instruct 28.66 9.74 8.29 33.34 19.05 17.39 20.25 20.12 21.75 22.65 20.83 20.78 19.52 16.54 20.31 30.09 25.26 21.24 14.24
Phi-4-mini-instruct 28.36 6.94 7.41 22.85 21.22 13.67 17.49 18.22 17.15 19.94 17.61 19.60 15.33 13.63 17.35 36.35 21.55 19.09 11.82
Qwen3-4B 23.15 11.07 9.88 24.59 14.80 13.30 16.71 17.69 16.77 16.94 17.51 17.00 17.32 14.64 16.84 18.10 21.70 15.66 12.77
Gemma-3-4B-it 31.60 13.40 14.29 21.98 19.08 19.97 20.56 21.98 23.24 18.92 20.87 20.30 22.29 20.74 21.19 20.37 28.67 19.95 13.07
Llama-3.1-8B-Instruct 47.73 19.09 21.29 15.11 22.41 22.97 25.69 26.94 24.68 22.07 27.51 25.29 26.56 28.83 25.98 26.04 34.66 22.66 19.74
Qwen2.5-7B-Instruct 35.85 15.18 20.58 50.78 39.11 30.50 32.05 31.93 35.30 36.89 32.43 34.74 33.72 30.55 33.65 18.82 40.84 29.35 25.95
Qwen2.5-7B-Instruct-1M 47.52 20.13 23.79 52.77 24.88 24.65 33.61 34.88 35.29 35.77 34.44 35.64 33.73 31.53 34.47 12.31 40.55 32.43 27.83
TableGPT2-7B 41.10 17.68 18.80 45.34 39.82 28.64 32.26 34.46 32.52 34.10 32.40 36.73 33.33 30.48 33.43 18.70 42.15 29.11 25.50
TableLLM-Qwen2-7B 18.00 4.59 5.34 10.51 6.69 5.94 8.98 10.55 8.63 8.18 10.95 7.52 9.52 8.87 9.17 37.39 13.46 8.76 4.73
TableLLM-Llama3.1-8B 27.72 1.87 2.38 11.65 5.15 7.78 10.31 11.12 10.81 10.53 10.76 10.39 10.50 7.24 10.19 38.06 11.88 12.14 6.90
TableLlama 3.36 1.38 1.41 2.25 1.61 0.72 1.83 3.25 0.47 0.89 2.51 0.35 2.80 1.43 1.67 173.23 4.97 0.36 0.17
Mistral-7B-Instruct-v0.3 29.32 14.16 15.59 19.89 15.46 14.85 19.19 20.56 19.67 16.81 18.97 19.06 18.56 20.69 19.19 20.27 29.15 19.47 8.95
Ministral-8B-Instruct-2410 23.98 9.37 14.82 25.61 12.70 14.25 17.26 18.72 17.48 14.53 18.26 15.38 17.13 19.68 17.31 29.73 32.90 17.33 1.55
Qwen3-8B 35.97 14.18 17.02 46.68 27.87 25.29 28.55 29.54 28.11 28.76 30.65 27.46 29.32 29.06 28.99 11.01 39.73 25.11 20.80
GLM-4-9B-Chat 46.25 18.82 18.79 32.86 21.74 20.72 27.92 27.84 29.95 27.12 28.52 31.20 29.13 27.22 28.71 14.21 34.79 27.07 21.89
GLM-4-9B-Chat-1M 46.81 17.79 17.69 31.21 21.78 20.87 27.24 28.64 27.76 27.00 29.49 27.26 30.11 27.05 28.19 11.03 32.70 26.95 22.09
GLM-4-9B-0414 31.90 17.74 20.17 35.61 28.19 21.78 26.51 28.41 27.46 23.89 27.11 24.82 28.64 30.09 27.20 22.77 44.61 23.58 11.35
Gemma-3-12B-it 46.34 21.51 27.48 46.77 42.90 29.93 36.90 39.40 37.47 34.02 39.84 34.67 39.61 37.95 37.57 15.49 49.83 34.71 26.16
Mistral-Nemo-Instruct-2407 35.32 18.02 18.30 49.10 53.69 32.40 34.99 36.29 33.47 33.73 35.36 34.82 37.20 37.19 35.44 10.52 52.13 35.00 17.85
Qwen2.5-14B-Instruct 47.26 20.17 26.62 54.34 42.86 34.93 38.25 40.98 41.64 37.09 39.23 39.93 41.77 39.32 39.99 11.71 51.75 34.58 28.43
Qwen2.5-14B-Instruct-1M 56.51 22.16 29.39 63.84 37.37 37.15 42.37 43.93 44.94 39.14 44.65 44.27 46.20 43.80 43.85 16.12 50.98 40.11 36.03
Qwen3-14B 53.18 22.13 25.89 58.49 52.69 40.16 42.62 46.66 45.36 38.78 46.74 45.11 45.76 42.50 44.42 17.93 54.51 40.54 32.81
Mistral-Small-3.1-24B-Instruct-2503 62.33 28.98 29.01 41.93 48.95 42.40 43.06 46.66 44.98 41.37 48.02 44.59 48.30 45.47 45.63 15.19 56.60 41.20 31.38
Qwen3-30B-A3B 48.73 16.54 22.67 55.53 44.91 33.66 37.74 40.10 38.85 36.40 38.45 37.86 41.95 39.78 39.05 14.21 51.40 35.31 26.50
Qwen3-32B 51.48 24.62 26.92 51.37 46.19 42.67 40.72 44.27 42.78 40.79 43.28 45.57 43.64 43.82 43.45 10.99 52.09 37.79 32.29
GLM-4-32B-0414 43.71 25.43 26.39 55.39 51.14 35.24 40.48 47.21 39.62 35.61 41.88 36.39 44.51 45.79 41.57 27.91 58.09 43.04 20.31
Llama-3.1-70B-Instruct 62.58 31.90 29.15 56.69 36.26 37.67 44.35 49.85 46.14 39.74 46.16 45.18 46.16 45.78 47.99 7.73 52.62 46.84 33.57
Llama-3.3-70B-Instruct 60.33 33.37 33.47 41.31 41.67 37.13 42.74 47.88 41.49 35.83 46.87 38.57 49.11 48.72 44.07 30.12 56.65 40.96 30.60
Qwen2.5-72B-Instruct 59.43 32.15 33.72 63.55 57.17 45.96 49.94 52.63 52.16 45.10 51.74 52.57 52.23 53.53 51.42 16.40 63.97 46.51 39.36
Mistral-Large-Instruct-2411 55.96 29.69 29.10 57.10 46.99 44.20 44.99 49.84 45.37 40.19 49.13 42.21 50.55 49.32 46.66 22.20 63.87 45.23 25.89
DeepSeek-V3 69.63 44.51 43.65 66.36 57.44 54.18 57.09 61.39 62.37 53.67 62.42 58.96 61.71 59.71 60.03 14.57 70.80 54.67 45.80
GPT-4o-mini-2024-07-18 60.43 31.73 33.17 49.55 35.26 36.10 42.76 41.78 40.85 37.53 43.19 41.04 41.46 41.42 41.04 13.79 54.44 38.61 35.22
Gemini-2.0-flash 71.19 47.51 42.16 50.05 60.76 53.39 55.03 58.66 57.69 58.14 58.95 58.20 57.33 58.44 58.20 2.79 65.13 55.25 44.72
GPT-4o-2024-08-06 78.95 52.40 49.97 72.76 68.79 62.11 65.60 67.36 64.44 58.67 65.62 60.99 65.89 66.18 64.16 13.53 78.57 63.33 54.92
Claude-3.5-sonnet-20241022 76.47 46.13 42.26 68.45 65.37 58.91 60.63 64.09 60.70 52.03 58.88 59.05 61.16 61.26 59.60 20.24 75.59 58.46 47.84
Phi-4-mini-reasoning 3.45 2.39 1.35 11.79 3.10 4.51 4.25 5.18 3.95 3.74 4.73 5.20 3.96 3.36 4.30 42.78 9.12 2.47 1.15
Qwen3-4B-thinking 30.85 7.06 8.62 21.96 24.78 19.87 18.92 18.26 17.75 19.37 21.35 22.89 20.31 16.95 19.55 30.38 28.56 15.49 12.70
DeepSeek-R1-Distill-Qwen-7B 7.96 4.00 4.52 13.78 13.13 6.61 8.28 8.99 5.50 5.58 7.51 5.63 8.22 7.73 7.02 49.72 17.26 5.06 2.52
DeepSeek-R1-Distill-Llama-8B 33.25 25.24 15.19 38.07 33.58 20.53 29.28 30.71 30.12 23.38 28.28 27.34 34.75 28.54 29.02 39.20 39.95 31.87 16.02
Qwen3-8B-thinking 34.20 8.61 10.86 32.18 29.97 24.04 23.32 24.94 24.82 22.87 24.51 27.03 24.23 22.29 24.38 19.43 34.00 18.83 17.13
GLM-Z1-9B-0414 42.27 33.97 20.53 44.88 46.19 30.39 37.89 42.52 31.79 38.41 41.63 33.86 41.49 40.38 38.58 27.81 59.57 39.25 14.84
Qwen3-14B-thinking 46.66 14.54 18.38 51.10 42.99 31.72 35.31 38.05 35.64 32.39 37.88 38.37 36.80 35.64 36.39 16.43 47.94 30.40 27.60
Qwen3-30B-A3B-thinking 40.10 11.91 16.28 37.86 35.36 27.16 28.63 29.81 29.40 29.25 30.86 31.47 28.94 27.94 29.67 11.88 42.79 24.05 19.04
QwQ-32B 59.44 40.07 28.73 59.33 58.68 45.93 50.31 54.12 54.04 47.39 53.57 48.58 53.99 52.80 52.07 12.91 65.43 49.72 35.77
DeepSeek-R1-Distill-Qwen-32B 50.73 39.06 25.43 52.48 57.16 44.55 45.60 52.13 46.52 41.33 49.58 44.12 49.71 50.45 47.69 22.66 66.13 47.76 22.91
Qwen3-32B-thinking 48.29 14.86 17.70 48.90 41.80 34.08 34.94 38.95 36.09 30.94 36.56 38.58 35.94 36.84 36.27 22.08 48.09 30.15 26.57
GLM-Z1-32B-0414 47.70 37.50 25.13 39.45 57.02 34.67 41.81 43.61 38.26 44.90 42.23 39.84 45.69 42.71 42.46 17.49 58.74 41.11 25.59
DeepSeek-R1-Distill-Llama-70B 56.02 42.10 27.57 51.87 52.61 44.58 47.17 51.20 46.90 43.42 53.20 43.82 52.08 51.07 48.81 20.03 62.88 49.64 28.98
DeepSeek-R1 71.24 66.21 42.41 62.68 65.28 57.87 62.82 67.92 67.84 61.26 66.55 64.21 65.81 65.29 65.55 10.15 74.63 64.37 49.45
Gemini-2.0-flash-thinking-exp-01-21 69.95 54.74 39.76 51.48 64.26 54.04 56.98 58.61 61.44 61.25 60.69 59.59 59.47 60.13 60.17 4.70 70.16 56.05 44.74

Complete results available in our paper.


πŸ“ Dataset Format

The dataset is organized as follows:

datasets/
β”œβ”€β”€ tables/          # Table files (by source and length)
└── questions/       # QA files (organized by length & turn type)

Single-Turn Format

{
  "question_id": "Unique ID",
  "db_id": "Database ID",
  "question": "Question text (possibly with external knowledge)",
  "answer": "Ground truth (list/dict format)",
  "highlighted_table": ["Relevant table IDs"],
  "is_multi_table": true,
  "question_type": "EM | BCF | FCM | FR | EKF | INI",
  "db_path": "Path to the source table"
}

Multi-Turn Format

{
  "question_id": "Unique ID",
  "db_id": "Database ID",
  "evidence": "Optional external knowledge",
  "QA": [
    {"round": 1, "question": "Q1", "answer": "A1"},
    {"round": 2, "question": "Q2", "answer": "A2"}
  ],
  "highlighted_table": ["Relevant tables"],
  "is_multi_table": true,
  "question_type": "EM | BCF | FCM | FR | EKF | INI",
  "db_path": "Path to the table"
}

πŸš€ Quick Start

Prerequisites

git clone https://github.com/liyaooi/LongTableBench
cd LongTableBench
pip install -r requirements.txt

1. Model Deployment (vLLM)

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --api-key $YOUR_API_KEY \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.95 \
  --max_model_len 131072 \
  --trust-remote-code

Note: You can modify pred.py to use other serving frameworks

2. Running Evaluation

python pred.py \
  --json-path datasets/questions/8k/longtablebench_single.json \
  --tables_path datasets/tables/ \
  --content_type single \
  --table_format markdown \
  --model-path Qwen/Qwen2.5-7B-Instruct \
  --output-path results/ \
  --key $YOUR_API_KEY

Note: You can also refer to pred.sh

Command Line Options

Parameter Description Default
--json-path Input JSON file Required
--tables_path Directory containing tables Required
--content_type single or multi turn Required
--table_format Table format markdown
--model-path Model ID or path Required
--output-path Output directory Required
--key API key Required
--url Custom model API endpoint None
--num-gpus-total Total GPUs 1
--num-gpus-per-model GPUs per job 1

Model Calling Mechanism:

  • Default implementation uses the openai library for model inference
  • Can be adapted to other frameworks by modifying pred.py
  • Current implementation supports both single-turn and multi-turn conversations

πŸ§ͺ Evaluation Protocol

  • Metric: F1 score (macro over structured answers)
  • Setting: Zero-shot, greedy decoding (temperature=0)
  • Truncation: Middle truncation for inputs > context window
  • FR task: Must include evidence (table cell or row reference)

Note: The files in the category directory are required for proper functionality. If you modify their paths, ensure corresponding updates in eval/result_process.py to avoid errors.


πŸ“œ License


πŸ“š Citation

If you use this benchmark in your research, please cite:

@inproceedings{li-etal-2025-longtablebench,
    title = "{L}ong{T}able{B}ench: Benchmarking Long-Context Table Reasoning across Real-World Formats and Domains",
    author = "Li, Liyao  and  Tian, Jiaming  and  Chen, Hao  and  Ye, Wentao  and  Ye, Chao  and  Wang, Haobo  and  Wang, Ningtao  and  Fu, Xing  and  Chen, Gang  and  Zhao, Junbo",
    editor = "Christodoulopoulos, Christos  and  Chakraborty, Tanmoy  and  Rose, Carolyn  and  Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.638/",
    pages = "11927--11965",
    ISBN = "979-8-89176-335-7"
}

πŸ“ Notes

  • Our code is constantly being optimized, which may be slightly different from the one in the paper.
  • For questions or issues, please open an issue on GitHub.

About

Official repository for LongTableBench (EMNLP'25 paper), a comprehensive benchmark for evaluating long-context reasoning over semi-structured tables across diverse formats, tasks, and domains.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published