Official repository for LongTableBench (EMNLP'25 paper), a comprehensive benchmark for evaluating long-context reasoning over semi-structured tables across diverse formats, tasks, and domains.
LongTableBench is the first multitask benchmark designed to evaluate the reasoning ability of large language models (LLMs) over long-context semi-structured tables. It features diverse tasks, formats, and domains, ensuring comprehensive coverage of real-world reasoning challenges.
- 5,950 QA instances derived from 850 seed questions
- 7 table formats: Markdown, HTML, JSON, LaTeX, SQL, XML, CSV
- 18 domains, covering medical, finance, education, entertainment, etc.
- Context lengths up to 128K tokens
- Single-turn & multi-turn, single-table & multi-table scenarios
- Rigorous symbolic verification, cross-model validation, and human review
LongTableBench includes six carefully designed tasks, evaluating three fundamental dimensions: Structural Complexity, Long-Range Dependencies, and Semantic Integration.
| Task | Abbr. | Description | Primary Challenge |
|---|---|---|---|
| Exact Matching | EM | Locate and extract exact cell values while maintaining structural awareness. | Structural |
| Basic Conditional Filtering | BCF | Apply simple logical filters (e.g., select rows by condition). | Structural |
| Fuzzy Conditional Manipulation | FCM | Handle sorting, aggregation, and approximate matching under fuzzy conditions. | Long-Range |
| Fact Retrieval | FR | Retrieve or verify facts grounded in tabular evidence. | Long-Range |
| External Knowledge Fusion | EKF | Combine tabular data with external or commonsense knowledge. | Semantic |
| Irregular Numerical Interpretation | INI | Interpret unconventional numerical formats (Roman numerals, date variants, etc.). | Semantic |
Each task can appear in single-turn / multi-turn and single-table / multi-table settings.
- Total instances: 5,950
- Length distribution:
- 40% short (0β8K tokens)
- 35% medium (8Kβ32K)
- 25% long (32Kβ128K)
- Task proportion:
- 35% Structural (EM, BCF)
- 35% Long-range (FCM, FR)
- 30% Semantic (EKF, INI)
π Complete Results Table (Click to expand)
| Models | EM | BCF | FCM | FR | EKF | INI | Avg. | MD | HTML | JSON | SQL | XML | LaTeX | CSV | Avg. | Range | 8K | 32K | 128K |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Llama-3.2-3B-Instruct | 24.30 | 14.78 | 13.76 | 2.99 | 11.99 | 9.68 | 13.94 | 14.35 | 13.36 | 11.72 | 16.10 | 13.14 | 12.35 | 15.23 | 13.75 | 31.83 | 21.76 | 11.92 | 8.15 |
| Qwen2.5-3B-Instruct | 17.14 | 7.07 | 9.40 | 33.74 | 36.44 | 18.63 | 20.40 | 21.04 | 20.19 | 22.46 | 24.66 | 21.00 | 20.11 | 19.25 | 21.25 | 25.49 | 27.20 | 18.21 | 15.78 |
| Phi-3-mini-128k-instruct | 28.66 | 9.74 | 8.29 | 33.34 | 19.05 | 17.39 | 20.25 | 20.12 | 21.75 | 22.65 | 20.83 | 20.78 | 19.52 | 16.54 | 20.31 | 30.09 | 25.26 | 21.24 | 14.24 |
| Phi-4-mini-instruct | 28.36 | 6.94 | 7.41 | 22.85 | 21.22 | 13.67 | 17.49 | 18.22 | 17.15 | 19.94 | 17.61 | 19.60 | 15.33 | 13.63 | 17.35 | 36.35 | 21.55 | 19.09 | 11.82 |
| Qwen3-4B | 23.15 | 11.07 | 9.88 | 24.59 | 14.80 | 13.30 | 16.71 | 17.69 | 16.77 | 16.94 | 17.51 | 17.00 | 17.32 | 14.64 | 16.84 | 18.10 | 21.70 | 15.66 | 12.77 |
| Gemma-3-4B-it | 31.60 | 13.40 | 14.29 | 21.98 | 19.08 | 19.97 | 20.56 | 21.98 | 23.24 | 18.92 | 20.87 | 20.30 | 22.29 | 20.74 | 21.19 | 20.37 | 28.67 | 19.95 | 13.07 |
| Llama-3.1-8B-Instruct | 47.73 | 19.09 | 21.29 | 15.11 | 22.41 | 22.97 | 25.69 | 26.94 | 24.68 | 22.07 | 27.51 | 25.29 | 26.56 | 28.83 | 25.98 | 26.04 | 34.66 | 22.66 | 19.74 |
| Qwen2.5-7B-Instruct | 35.85 | 15.18 | 20.58 | 50.78 | 39.11 | 30.50 | 32.05 | 31.93 | 35.30 | 36.89 | 32.43 | 34.74 | 33.72 | 30.55 | 33.65 | 18.82 | 40.84 | 29.35 | 25.95 |
| Qwen2.5-7B-Instruct-1M | 47.52 | 20.13 | 23.79 | 52.77 | 24.88 | 24.65 | 33.61 | 34.88 | 35.29 | 35.77 | 34.44 | 35.64 | 33.73 | 31.53 | 34.47 | 12.31 | 40.55 | 32.43 | 27.83 |
| TableGPT2-7B | 41.10 | 17.68 | 18.80 | 45.34 | 39.82 | 28.64 | 32.26 | 34.46 | 32.52 | 34.10 | 32.40 | 36.73 | 33.33 | 30.48 | 33.43 | 18.70 | 42.15 | 29.11 | 25.50 |
| TableLLM-Qwen2-7B | 18.00 | 4.59 | 5.34 | 10.51 | 6.69 | 5.94 | 8.98 | 10.55 | 8.63 | 8.18 | 10.95 | 7.52 | 9.52 | 8.87 | 9.17 | 37.39 | 13.46 | 8.76 | 4.73 |
| TableLLM-Llama3.1-8B | 27.72 | 1.87 | 2.38 | 11.65 | 5.15 | 7.78 | 10.31 | 11.12 | 10.81 | 10.53 | 10.76 | 10.39 | 10.50 | 7.24 | 10.19 | 38.06 | 11.88 | 12.14 | 6.90 |
| TableLlama | 3.36 | 1.38 | 1.41 | 2.25 | 1.61 | 0.72 | 1.83 | 3.25 | 0.47 | 0.89 | 2.51 | 0.35 | 2.80 | 1.43 | 1.67 | 173.23 | 4.97 | 0.36 | 0.17 |
| Mistral-7B-Instruct-v0.3 | 29.32 | 14.16 | 15.59 | 19.89 | 15.46 | 14.85 | 19.19 | 20.56 | 19.67 | 16.81 | 18.97 | 19.06 | 18.56 | 20.69 | 19.19 | 20.27 | 29.15 | 19.47 | 8.95 |
| Ministral-8B-Instruct-2410 | 23.98 | 9.37 | 14.82 | 25.61 | 12.70 | 14.25 | 17.26 | 18.72 | 17.48 | 14.53 | 18.26 | 15.38 | 17.13 | 19.68 | 17.31 | 29.73 | 32.90 | 17.33 | 1.55 |
| Qwen3-8B | 35.97 | 14.18 | 17.02 | 46.68 | 27.87 | 25.29 | 28.55 | 29.54 | 28.11 | 28.76 | 30.65 | 27.46 | 29.32 | 29.06 | 28.99 | 11.01 | 39.73 | 25.11 | 20.80 |
| GLM-4-9B-Chat | 46.25 | 18.82 | 18.79 | 32.86 | 21.74 | 20.72 | 27.92 | 27.84 | 29.95 | 27.12 | 28.52 | 31.20 | 29.13 | 27.22 | 28.71 | 14.21 | 34.79 | 27.07 | 21.89 |
| GLM-4-9B-Chat-1M | 46.81 | 17.79 | 17.69 | 31.21 | 21.78 | 20.87 | 27.24 | 28.64 | 27.76 | 27.00 | 29.49 | 27.26 | 30.11 | 27.05 | 28.19 | 11.03 | 32.70 | 26.95 | 22.09 |
| GLM-4-9B-0414 | 31.90 | 17.74 | 20.17 | 35.61 | 28.19 | 21.78 | 26.51 | 28.41 | 27.46 | 23.89 | 27.11 | 24.82 | 28.64 | 30.09 | 27.20 | 22.77 | 44.61 | 23.58 | 11.35 |
| Gemma-3-12B-it | 46.34 | 21.51 | 27.48 | 46.77 | 42.90 | 29.93 | 36.90 | 39.40 | 37.47 | 34.02 | 39.84 | 34.67 | 39.61 | 37.95 | 37.57 | 15.49 | 49.83 | 34.71 | 26.16 |
| Mistral-Nemo-Instruct-2407 | 35.32 | 18.02 | 18.30 | 49.10 | 53.69 | 32.40 | 34.99 | 36.29 | 33.47 | 33.73 | 35.36 | 34.82 | 37.20 | 37.19 | 35.44 | 10.52 | 52.13 | 35.00 | 17.85 |
| Qwen2.5-14B-Instruct | 47.26 | 20.17 | 26.62 | 54.34 | 42.86 | 34.93 | 38.25 | 40.98 | 41.64 | 37.09 | 39.23 | 39.93 | 41.77 | 39.32 | 39.99 | 11.71 | 51.75 | 34.58 | 28.43 |
| Qwen2.5-14B-Instruct-1M | 56.51 | 22.16 | 29.39 | 63.84 | 37.37 | 37.15 | 42.37 | 43.93 | 44.94 | 39.14 | 44.65 | 44.27 | 46.20 | 43.80 | 43.85 | 16.12 | 50.98 | 40.11 | 36.03 |
| Qwen3-14B | 53.18 | 22.13 | 25.89 | 58.49 | 52.69 | 40.16 | 42.62 | 46.66 | 45.36 | 38.78 | 46.74 | 45.11 | 45.76 | 42.50 | 44.42 | 17.93 | 54.51 | 40.54 | 32.81 |
| Mistral-Small-3.1-24B-Instruct-2503 | 62.33 | 28.98 | 29.01 | 41.93 | 48.95 | 42.40 | 43.06 | 46.66 | 44.98 | 41.37 | 48.02 | 44.59 | 48.30 | 45.47 | 45.63 | 15.19 | 56.60 | 41.20 | 31.38 |
| Qwen3-30B-A3B | 48.73 | 16.54 | 22.67 | 55.53 | 44.91 | 33.66 | 37.74 | 40.10 | 38.85 | 36.40 | 38.45 | 37.86 | 41.95 | 39.78 | 39.05 | 14.21 | 51.40 | 35.31 | 26.50 |
| Qwen3-32B | 51.48 | 24.62 | 26.92 | 51.37 | 46.19 | 42.67 | 40.72 | 44.27 | 42.78 | 40.79 | 43.28 | 45.57 | 43.64 | 43.82 | 43.45 | 10.99 | 52.09 | 37.79 | 32.29 |
| GLM-4-32B-0414 | 43.71 | 25.43 | 26.39 | 55.39 | 51.14 | 35.24 | 40.48 | 47.21 | 39.62 | 35.61 | 41.88 | 36.39 | 44.51 | 45.79 | 41.57 | 27.91 | 58.09 | 43.04 | 20.31 |
| Llama-3.1-70B-Instruct | 62.58 | 31.90 | 29.15 | 56.69 | 36.26 | 37.67 | 44.35 | 49.85 | 46.14 | 39.74 | 46.16 | 45.18 | 46.16 | 45.78 | 47.99 | 7.73 | 52.62 | 46.84 | 33.57 |
| Llama-3.3-70B-Instruct | 60.33 | 33.37 | 33.47 | 41.31 | 41.67 | 37.13 | 42.74 | 47.88 | 41.49 | 35.83 | 46.87 | 38.57 | 49.11 | 48.72 | 44.07 | 30.12 | 56.65 | 40.96 | 30.60 |
| Qwen2.5-72B-Instruct | 59.43 | 32.15 | 33.72 | 63.55 | 57.17 | 45.96 | 49.94 | 52.63 | 52.16 | 45.10 | 51.74 | 52.57 | 52.23 | 53.53 | 51.42 | 16.40 | 63.97 | 46.51 | 39.36 |
| Mistral-Large-Instruct-2411 | 55.96 | 29.69 | 29.10 | 57.10 | 46.99 | 44.20 | 44.99 | 49.84 | 45.37 | 40.19 | 49.13 | 42.21 | 50.55 | 49.32 | 46.66 | 22.20 | 63.87 | 45.23 | 25.89 |
| DeepSeek-V3 | 69.63 | 44.51 | 43.65 | 66.36 | 57.44 | 54.18 | 57.09 | 61.39 | 62.37 | 53.67 | 62.42 | 58.96 | 61.71 | 59.71 | 60.03 | 14.57 | 70.80 | 54.67 | 45.80 |
| GPT-4o-mini-2024-07-18 | 60.43 | 31.73 | 33.17 | 49.55 | 35.26 | 36.10 | 42.76 | 41.78 | 40.85 | 37.53 | 43.19 | 41.04 | 41.46 | 41.42 | 41.04 | 13.79 | 54.44 | 38.61 | 35.22 |
| Gemini-2.0-flash | 71.19 | 47.51 | 42.16 | 50.05 | 60.76 | 53.39 | 55.03 | 58.66 | 57.69 | 58.14 | 58.95 | 58.20 | 57.33 | 58.44 | 58.20 | 2.79 | 65.13 | 55.25 | 44.72 |
| GPT-4o-2024-08-06 | 78.95 | 52.40 | 49.97 | 72.76 | 68.79 | 62.11 | 65.60 | 67.36 | 64.44 | 58.67 | 65.62 | 60.99 | 65.89 | 66.18 | 64.16 | 13.53 | 78.57 | 63.33 | 54.92 |
| Claude-3.5-sonnet-20241022 | 76.47 | 46.13 | 42.26 | 68.45 | 65.37 | 58.91 | 60.63 | 64.09 | 60.70 | 52.03 | 58.88 | 59.05 | 61.16 | 61.26 | 59.60 | 20.24 | 75.59 | 58.46 | 47.84 |
| Phi-4-mini-reasoning | 3.45 | 2.39 | 1.35 | 11.79 | 3.10 | 4.51 | 4.25 | 5.18 | 3.95 | 3.74 | 4.73 | 5.20 | 3.96 | 3.36 | 4.30 | 42.78 | 9.12 | 2.47 | 1.15 |
| Qwen3-4B-thinking | 30.85 | 7.06 | 8.62 | 21.96 | 24.78 | 19.87 | 18.92 | 18.26 | 17.75 | 19.37 | 21.35 | 22.89 | 20.31 | 16.95 | 19.55 | 30.38 | 28.56 | 15.49 | 12.70 |
| DeepSeek-R1-Distill-Qwen-7B | 7.96 | 4.00 | 4.52 | 13.78 | 13.13 | 6.61 | 8.28 | 8.99 | 5.50 | 5.58 | 7.51 | 5.63 | 8.22 | 7.73 | 7.02 | 49.72 | 17.26 | 5.06 | 2.52 |
| DeepSeek-R1-Distill-Llama-8B | 33.25 | 25.24 | 15.19 | 38.07 | 33.58 | 20.53 | 29.28 | 30.71 | 30.12 | 23.38 | 28.28 | 27.34 | 34.75 | 28.54 | 29.02 | 39.20 | 39.95 | 31.87 | 16.02 |
| Qwen3-8B-thinking | 34.20 | 8.61 | 10.86 | 32.18 | 29.97 | 24.04 | 23.32 | 24.94 | 24.82 | 22.87 | 24.51 | 27.03 | 24.23 | 22.29 | 24.38 | 19.43 | 34.00 | 18.83 | 17.13 |
| GLM-Z1-9B-0414 | 42.27 | 33.97 | 20.53 | 44.88 | 46.19 | 30.39 | 37.89 | 42.52 | 31.79 | 38.41 | 41.63 | 33.86 | 41.49 | 40.38 | 38.58 | 27.81 | 59.57 | 39.25 | 14.84 |
| Qwen3-14B-thinking | 46.66 | 14.54 | 18.38 | 51.10 | 42.99 | 31.72 | 35.31 | 38.05 | 35.64 | 32.39 | 37.88 | 38.37 | 36.80 | 35.64 | 36.39 | 16.43 | 47.94 | 30.40 | 27.60 |
| Qwen3-30B-A3B-thinking | 40.10 | 11.91 | 16.28 | 37.86 | 35.36 | 27.16 | 28.63 | 29.81 | 29.40 | 29.25 | 30.86 | 31.47 | 28.94 | 27.94 | 29.67 | 11.88 | 42.79 | 24.05 | 19.04 |
| QwQ-32B | 59.44 | 40.07 | 28.73 | 59.33 | 58.68 | 45.93 | 50.31 | 54.12 | 54.04 | 47.39 | 53.57 | 48.58 | 53.99 | 52.80 | 52.07 | 12.91 | 65.43 | 49.72 | 35.77 |
| DeepSeek-R1-Distill-Qwen-32B | 50.73 | 39.06 | 25.43 | 52.48 | 57.16 | 44.55 | 45.60 | 52.13 | 46.52 | 41.33 | 49.58 | 44.12 | 49.71 | 50.45 | 47.69 | 22.66 | 66.13 | 47.76 | 22.91 |
| Qwen3-32B-thinking | 48.29 | 14.86 | 17.70 | 48.90 | 41.80 | 34.08 | 34.94 | 38.95 | 36.09 | 30.94 | 36.56 | 38.58 | 35.94 | 36.84 | 36.27 | 22.08 | 48.09 | 30.15 | 26.57 |
| GLM-Z1-32B-0414 | 47.70 | 37.50 | 25.13 | 39.45 | 57.02 | 34.67 | 41.81 | 43.61 | 38.26 | 44.90 | 42.23 | 39.84 | 45.69 | 42.71 | 42.46 | 17.49 | 58.74 | 41.11 | 25.59 |
| DeepSeek-R1-Distill-Llama-70B | 56.02 | 42.10 | 27.57 | 51.87 | 52.61 | 44.58 | 47.17 | 51.20 | 46.90 | 43.42 | 53.20 | 43.82 | 52.08 | 51.07 | 48.81 | 20.03 | 62.88 | 49.64 | 28.98 |
| DeepSeek-R1 | 71.24 | 66.21 | 42.41 | 62.68 | 65.28 | 57.87 | 62.82 | 67.92 | 67.84 | 61.26 | 66.55 | 64.21 | 65.81 | 65.29 | 65.55 | 10.15 | 74.63 | 64.37 | 49.45 |
| Gemini-2.0-flash-thinking-exp-01-21 | 69.95 | 54.74 | 39.76 | 51.48 | 64.26 | 54.04 | 56.98 | 58.61 | 61.44 | 61.25 | 60.69 | 59.59 | 59.47 | 60.13 | 60.17 | 4.70 | 70.16 | 56.05 | 44.74 |
Complete results available in our paper.
The dataset is organized as follows:
datasets/
βββ tables/ # Table files (by source and length)
βββ questions/ # QA files (organized by length & turn type)
{
"question_id": "Unique ID",
"db_id": "Database ID",
"question": "Question text (possibly with external knowledge)",
"answer": "Ground truth (list/dict format)",
"highlighted_table": ["Relevant table IDs"],
"is_multi_table": true,
"question_type": "EM | BCF | FCM | FR | EKF | INI",
"db_path": "Path to the source table"
}{
"question_id": "Unique ID",
"db_id": "Database ID",
"evidence": "Optional external knowledge",
"QA": [
{"round": 1, "question": "Q1", "answer": "A1"},
{"round": 2, "question": "Q2", "answer": "A2"}
],
"highlighted_table": ["Relevant tables"],
"is_multi_table": true,
"question_type": "EM | BCF | FCM | FR | EKF | INI",
"db_path": "Path to the table"
}git clone https://github.com/liyaooi/LongTableBench
cd LongTableBench
pip install -r requirements.txtvllm serve Qwen/Qwen2.5-7B-Instruct \
--api-key $YOUR_API_KEY \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.95 \
--max_model_len 131072 \
--trust-remote-codeNote: You can modify pred.py to use other serving frameworks
python pred.py \
--json-path datasets/questions/8k/longtablebench_single.json \
--tables_path datasets/tables/ \
--content_type single \
--table_format markdown \
--model-path Qwen/Qwen2.5-7B-Instruct \
--output-path results/ \
--key $YOUR_API_KEYNote: You can also refer to pred.sh
| Parameter | Description | Default |
|---|---|---|
--json-path |
Input JSON file | Required |
--tables_path |
Directory containing tables | Required |
--content_type |
single or multi turn |
Required |
--table_format |
Table format | markdown |
--model-path |
Model ID or path | Required |
--output-path |
Output directory | Required |
--key |
API key | Required |
--url |
Custom model API endpoint | None |
--num-gpus-total |
Total GPUs | 1 |
--num-gpus-per-model |
GPUs per job | 1 |
Model Calling Mechanism:
- Default implementation uses the
openailibrary for model inference - Can be adapted to other frameworks by modifying
pred.py - Current implementation supports both single-turn and multi-turn conversations
- Metric: F1 score (macro over structured answers)
- Setting: Zero-shot, greedy decoding (
temperature=0) - Truncation: Middle truncation for inputs > context window
- FR task: Must include evidence (table cell or row reference)
Note: The files in the category directory are required for proper functionality. If you modify their paths, ensure corresponding updates in eval/result_process.py to avoid errors.
If you use this benchmark in your research, please cite:
@inproceedings{li-etal-2025-longtablebench,
title = "{L}ong{T}able{B}ench: Benchmarking Long-Context Table Reasoning across Real-World Formats and Domains",
author = "Li, Liyao and Tian, Jiaming and Chen, Hao and Ye, Wentao and Ye, Chao and Wang, Haobo and Wang, Ningtao and Fu, Xing and Chen, Gang and Zhao, Junbo",
editor = "Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.638/",
pages = "11927--11965",
ISBN = "979-8-89176-335-7"
}- Our code is constantly being optimized, which may be slightly different from the one in the paper.
- For questions or issues, please open an issue on GitHub.
