From e1d18b52c74a938b01c1bdf0c0475667c43753b5 Mon Sep 17 00:00:00 2001 From: bruce2233 <54824693+bruce2233@users.noreply.github.com> Date: Sat, 31 Jan 2026 06:37:55 +0000 Subject: [PATCH] Generate WAP paper page --- index.html | 19 + .../grad-en.html | 88 ++++ .../grad-zh.html | 88 ++++ .../https-arxiv-org-abs-2511-13719/hs-en.html | 67 +++ .../https-arxiv-org-abs-2511-13719/hs-zh.html | 67 +++ .../https-arxiv-org-abs-2511-13719/index.html | 97 ++++ .../https-arxiv-org-abs-2511-13719/script.js | 77 +++ .../https-arxiv-org-abs-2511-13719/styles.css | 457 ++++++++++++++++++ 8 files changed, 960 insertions(+) create mode 100644 papers/https-arxiv-org-abs-2511-13719/grad-en.html create mode 100644 papers/https-arxiv-org-abs-2511-13719/grad-zh.html create mode 100644 papers/https-arxiv-org-abs-2511-13719/hs-en.html create mode 100644 papers/https-arxiv-org-abs-2511-13719/hs-zh.html create mode 100644 papers/https-arxiv-org-abs-2511-13719/index.html create mode 100644 papers/https-arxiv-org-abs-2511-13719/script.js create mode 100644 papers/https-arxiv-org-abs-2511-13719/styles.css diff --git a/index.html b/index.html index a5082c9..56e7fc2 100644 --- a/index.html +++ b/index.html @@ -103,6 +103,25 @@
A technical deep-dive into deep multimodal parsing, adaptive retrieval, and agentic evidence synthesis.
+Current "Deep Research" systems (based on LLMs) are largely restricted to text-based web scraping. In professional and scientific domains, knowledge is dense in highly structured multimodal documents (PDFs/Scans). Standard RAG (Retrieval-Augmented Generation) pipelines fail here because they often "flatten" the structure, losing vital visual semantics like the relationship between a chart's axes or the hierarchical context of a table.
+Doc-Researcher employs a parsing engine that preserves multimodal integrity. It creates multi-granular representations:
+The system utilizes an architecture that supports three paradigms:
+Unlike single-pass retrieval, Doc-Researcher uses an agentic loop:
+To evaluate these capabilities, the authors introduced M4DocBench (Multi-modal, Multi-hop, Multi-document, and Multi-turn). It consists of 158 expert-level questions spanning 304 documents. This benchmark requires the model to "connect the dots" across multiple files and modalities.
+技术深潜:深度多模态解析、自适应检索与代理式证据合成。
+当前的“深度研究 (Deep Research)”系统(如基于 LLM 的系统)主要局限于文本类 Web 数据。在专业领域,核心知识往往以高度结构化的多模态文档(PDF/扫描件)形式存在。传统的 RAG(检索增强生成)流程在这种场景下通常会失效,因为它们将文档“扁平化”,丢失了图表轴线、视觉层次或表格嵌套关系等关键视觉语义。
+Doc-Researcher 采用了一种能够保持多模态完整性的解析引擎。它建立了多层级的表示体系:
+Doc-Researcher 支持三种检索范式:
+不同于单次检索,Doc-Researcher 引入了代理循环:
+为了全面评估上述能力,作者提出了 M4DocBench(多模态、多跳、多文档、多轮对话)。它包含由专家标注的 158 个高难度问题,涉及 304 份复杂文档。该基准要求模型能够跨文件、跨模态“连接线索”。
+Most AI systems only "read" text. Doc-Researcher is a new system that actually understands charts, tables, and layouts like a human expert does.
+Imagine asking an AI to analyze a 50-page financial report or a scientific paper. Most current AIs can grab the text, but they get confused by complex layout diagrams, math equations, or data hidden in tables. They treat everything like a flat block of words, missing the "visual language" of the document.
+The researchers created a three-step brain for the AI:
+The team created a new test called M4DocBench. It has 158 very hard questions that requires "jumping" between different documents and looking at pictures to find the answer.
+If you want to see the specific technical architecture and deep data science behind this, check out the Graduate version.
+ View Graduate Version (EN) +大多数 AI 系统只能“读文字”。Doc-Researcher 却像人类专家一样,能够读懂图表、表格和文档布局。
+想象一下,让你分析一份 50 页的财务报告或一篇科学论文。大多数 AI 只能提取其中的文本,但当遇到复杂的结构图、数学公式或隐藏在表格中的数据时,它们就会感到困惑。由于丢失了图片和布局信息,AI 无法从真正专业的文档中获取深层知识。
+研究人员为 AI 打造了三个关键组件:
+团队创建了一个名为 M4DocBench 的新测试,包含 158 个非常困难的问题,这些问题需要 AI 在多个文档之间“跳转”并查看图片才能回答。
+如果你想了解这背后的具体架构和深层数据科学,请查看研究生版本。
+ 查看研究生版本 (中文) +A groundbreaking system that solves complex research queries by deeply parsing multimodal documents (figures, tables, charts) and using iterative agent workflows.
+ +