Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,25 @@ <h2>快速入口</h2>

<div id="paper-list" class="paper-grid">
<a class="paper-card" href="/attention-is-all-you-need" data-title="Attention Is All You Need 注意力机制即你所需" data-tags="transformer attention machine translation encoder-decoder self-attention" data-arxiv="1706.03762">
<a class="paper-card" href="/https-arxiv-org-abs-2511-13719" data-title="Doc-Researcher: A Unified System for Multimodal Document Parsing and Deep Research" data-tags="multimodal-parsing hybrid-retrieval deep-research multi-agent benchmarks M4DocBench" data-arxiv="2510.21603">
<h3>
<span class="lang" data-lang="en">Doc-Researcher</span>
<span class="lang" data-lang="zh" lang="zh-Hans">Doc-Researcher</span>
</h3>
<div class="level" data-level="hs">
<span class="lang" data-lang="en">AI expert that reads charts, tables, and layouts like a human for complex documents.</span>
<span class="lang" data-lang="zh" lang="zh-Hans">像人类专家一样阅读图表、表格与布局,解决复杂文档研究任务。</span>
</div>
<div class="level" data-level="grad">
<span class="lang" data-lang="en">Deep multimodal parsing + hybrid retrieval paradigms + iterative multi-agent workflows; 50.6% on M4DocBench.</span>
<span class="lang" data-lang="zh" lang="zh-Hans">深度多模态解析 + 混合检索范式 + 迭代多代理流;在 M4DocBench 取得 50.6% 准确率。</span>
</div>
<div class="pill-row">
<span class="pill">arXiv 2510.21603</span>
<span class="pill">Multimodal</span>
<span class="pill">Agents</span>
</div>
</a>
<h3>
<span class="lang" data-lang="en">Attention Is All You Need</span>
<span class="lang" data-lang="zh" lang="zh-Hans">注意力机制即你所需</span>
Expand Down
88 changes: 88 additions & 0 deletions papers/https-arxiv-org-abs-2511-13719/grad-en.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Doc-Researcher (Grad-EN) | WAP</title>
<link rel="stylesheet" href="/papers/https-arxiv-org-abs-2511-13719/styles.css" />
<link href="https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@400;500;600;700&display=swap" rel="stylesheet" />
</head>
<body>
<div class="backdrop" aria-hidden="true"></div>
<div class="page">
<header class="hero reveal">
<div class="eyebrow">GRADUATE EDITION / ENGLISH</div>
<h1>Doc-Researcher: Overcoming the Multimodal Processing Bottleneck</h1>
<p class="subtitle">A technical deep-dive into deep multimodal parsing, adaptive retrieval, and agentic evidence synthesis.</p>
</header>

<nav class="section-nav reveal">
<a href="#motivation">Motivation</a>
<a href="#parsing">Deep Parsing</a>
<a href="#retrieval">Retrieval Architecture</a>
<a href="#agents">Agent Workflows</a>
<a href="#bench">M4DocBench</a>
<a href="#results">Results</a>
</nav>

<section id="motivation" class="chapter reveal" data-section>
<h2>Motivation & Problem Statement</h2>
<p>Current "Deep Research" systems (based on LLMs) are largely restricted to text-based web scraping. In professional and scientific domains, knowledge is dense in <strong>highly structured multimodal documents</strong> (PDFs/Scans). Standard RAG (Retrieval-Augmented Generation) pipelines fail here because they often "flatten" the structure, losing vital visual semantics like the relationship between a chart's axes or the hierarchical context of a table.</p>
</section>

<section id="parsing" class="chapter reveal" data-section>
<h2>I. Deep Multimodal Parsing</h2>
<p>Doc-Researcher employs a parsing engine that preserves <strong>multimodal integrity</strong>. It creates multi-granular representations:</p>
<ul>
<li><strong>Chunk-level:</strong> Captures local context including equations and inline symbols.</li>
<li><strong>Block-level:</strong> Respects logical visual boundaries (e.g., a specific figure with its caption).</li>
<li><strong>Document-level:</strong> Maintains layout hierarchy and global semantics.</li>
</ul>
<div class="callout">Key Innovation: The system maps visual elements to text descriptions while keeping the original pixel features for vision-centric retrieval.</div>
</section>

<section id="retrieval" class="chapter reveal" data-section>
<h2>II. Systematic Hybrid Retrieval</h2>
<p>The system utilizes an architecture that supports three paradigms:</p>
<ol>
<li><strong>Text-only:</strong> Standard semantic search on text chunks.</li>
<li><strong>Vision-only:</strong> Directly retrieving document segments based on visual similarity.</li>
<li><strong>Hybrid:</strong> Combining text and vision signals with <em>dynamic granularity selection</em>—choosing between fine-grained chunks or broader document context based on query ambiguity.</li>
</ol>
</section>

<section id="agents" class="chapter reveal" data-section>
<h2>III. Iterative Multi-Agent Workflows</h2>
<p>Unlike single-pass retrieval, Doc-Researcher uses an agentic loop:</p>
<ul>
<li><strong>Planner:</strong> Decomposes complex, multi-hop queries into sub-tasks.</li>
<li><strong>Searcher:</strong> Executes the hybrid retrieval to find candidates.</li>
<li><strong>Refiner:</strong> Evaluates retrieved evidence and decides if more searching is needed (iterative accumulation).</li>
<li><strong>Synthesizer:</strong> Integrates multimodal evidence to form a final, cited answer.</li>
</ul>
</section>

<section id="bench" class="chapter reveal" data-section>
<h2>M4DocBench & Evaluation</h2>
<p>To evaluate these capabilities, the authors introduced <strong>M4DocBench</strong> (Multi-modal, Multi-hop, Multi-document, and Multi-turn). It consists of 158 expert-level questions spanning 304 documents. This benchmark requires the model to "connect the dots" across multiple files and modalities.</p>
</section>

<section id="results" class="chapter reveal" data-section>
<h2>Experimental Outcomes</h2>
<div class="highlight-row">
<div class="highlight-card">
<strong>Direct Comparison</strong>
<div>50.6% accuracy vs. ~15% for state-of-the-art baselines (3.4x improvement).</div>
</div>
<div class="highlight-card">
<strong>Ablation</strong>
<div>Removing the "Visual Semantics" component caused the largest performance drop, proving layout matters.</div>
</div>
</div>
</section>

<footer class="footer">WAP - Academic rigor for deep documents.</footer>
</div>
<script src="/papers/https-arxiv-org-abs-2511-13719/script.js"></script>
</body>
</html>
88 changes: 88 additions & 0 deletions papers/https-arxiv-org-abs-2511-13719/grad-zh.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
<!doctype html>
<html lang="zh-CN">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Doc-Researcher (研究生版) | WAP</title>
<link rel="stylesheet" href="/papers/https-arxiv-org-abs-2511-13719/styles.css" />
<link href="https://fonts.googleapis.com/css2?family=Noto+Sans+SC:wght@300;400;500;700&display=swap" rel="stylesheet" />
</head>
<body>
<div class="backdrop" aria-hidden="true"></div>
<div class="page">
<header class="hero reveal">
<div class="eyebrow">学术 / 研究生版</div>
<h1>Doc-Researcher:破解复杂文档多模态处理的瓶颈</h1>
<p class="subtitle">技术深潜:深度多模态解析、自适应检索与代理式证据合成。</p>
</header>

<nav class="section-nav reveal">
<a href="#motivation">动机</a>
<a href="#parsing">深度解析</a>
<a href="#retrieval">检索架构</a>
<a href="#agents">代理流</a>
<a href="#bench">评测体系</a>
<a href="#results">实验结果</a>
</nav>

<section id="motivation" class="chapter reveal" data-section>
<h2>研究动机与问题定义</h2>
<p>当前的“深度研究 (Deep Research)”系统(如基于 LLM 的系统)主要局限于文本类 Web 数据。在专业领域,核心知识往往以<strong>高度结构化的多模态文档</strong>(PDF/扫描件)形式存在。传统的 RAG(检索增强生成)流程在这种场景下通常会失效,因为它们将文档“扁平化”,丢失了图表轴线、视觉层次或表格嵌套关系等关键视觉语义。</p>
</section>

<section id="parsing" class="chapter reveal" data-section>
<h2>一、深度多模态解析引擎</h2>
<p>Doc-Researcher 采用了一种能够保持<strong>多模态完整性</strong>的解析引擎。它建立了多层级的表示体系:</p>
<ul>
<li><strong>块级 (Chunk-level):</strong> 捕捉局部上下文,包括行内公式和数学符号。</li>
<li><strong>模块级 (Block-level):</strong> 遵循逻辑视觉边界(例如带有标题的特定图表)。</li>
<li><strong>文档级 (Document-level):</strong> 维护全局的排版结构与语义。</li>
</ul>
<div class="callout">核心创新:该系统将视觉元素映射到文本描述,同时保留原始像素特征,用于视觉中心路径的检索。</div>
</section>

<section id="retrieval" class="chapter reveal" data-section>
<h2>二、系统化的混合检索架构</h2>
<p>Doc-Researcher 支持三种检索范式:</p>
<ol>
<li><strong>纯文本检索 (Text-only):</strong> 对文本块执行标准语义搜索。</li>
<li><strong>纯视觉检索 (Vision-only):</strong> 基于视觉相似度直接检索文档区域。</li>
<li><strong>混合检索 (Hybrid):</strong> 结合文本与视觉信号,并具备<em>动态粒度选择</em>能力——根据查询的模糊性在细粒度块或宏观文档上下文中自动切换。</li>
</ol>
</section>

<section id="agents" class="chapter reveal" data-section>
<h2>三、迭代多智能体工作流</h2>
<p>不同于单次检索,Doc-Researcher 引入了代理循环:</p>
<ul>
<li><strong>规划者 (Planner):</strong> 将复杂的多跳查询拆分为子任务。</li>
<li><strong>搜寻者 (Searcher):</strong> 执行混合检索寻找候选证据。</li>
<li><strong>精炼者 (Refiner):</strong> 评估检索证据,决定是否需要继续搜索(迭代式累计)。</li>
<li><strong>合成者 (Synthesizer):</strong> 整合多模态证据,生成带有引用的最终答案。</li>
</ul>
</section>

<section id="bench" class="chapter reveal" data-section>
<h2>M4DocBench 高难度评测</h2>
<p>为了全面评估上述能力,作者提出了 <strong>M4DocBench</strong>(多模态、多跳、多文档、多轮对话)。它包含由专家标注的 158 个高难度问题,涉及 304 份复杂文档。该基准要求模型能够跨文件、跨模态“连接线索”。</p>
</section>

<section id="results" class="chapter reveal" data-section>
<h2>实验表现</h2>
<div class="highlight-row">
<div class="highlight-card">
<strong>直接对比</strong>
<div>Doc-Researcher 准确率达到 50.6%,比目前最先进的基准系统(~15%)高出 3.4 倍。</div>
</div>
<div class="highlight-card">
<strong>消融实验</strong>
<div>移除“视觉语义”组件导致性能跌幅最大,证明了布局信息在文档理解中的核心地位。</div>
</div>
</div>
</section>

<footer class="footer">WAP - 为深度文档研究提供严谨洞察。</footer>
</div>
<script src="/papers/https-arxiv-org-abs-2511-13719/script.js"></script>
</body>
</html>
67 changes: 67 additions & 0 deletions papers/https-arxiv-org-abs-2511-13719/hs-en.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Doc-Researcher (HS-EN) | WAP</title>
<link rel="stylesheet" href="/papers/https-arxiv-org-abs-2511-13719/styles.css" />
<link href="https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@400;500;600;700&display=swap" rel="stylesheet" />
</head>
<body>
<div class="backdrop" aria-hidden="true"></div>
<div class="page">
<header class="hero reveal">
<div class="eyebrow">HIGH SCHOOL EDITION / ENGLISH</div>
<h1>How AI Reads Complex Documents: Doc-Researcher</h1>
<p class="subtitle">Most AI systems only "read" text. Doc-Researcher is a new system that actually understands charts, tables, and layouts like a human expert does.</p>
</header>

<nav class="section-nav reveal">
<a href="#problem">The Problem</a>
<a href="#how">How it Works</a>
<a href="#test">Testing It</a>
<a href="#grad">Go Deeper</a>
</nav>

<section id="problem" class="chapter reveal" data-section>
<h2>The "Wall" for Traditional AI</h2>
<p>Imagine asking an AI to analyze a 50-page financial report or a scientific paper. Most current AIs can grab the text, but they get confused by complex layout diagrams, math equations, or data hidden in tables. They treat everything like a flat block of words, missing the "visual language" of the document.</p>
<div class="callout">The Gap: AI has been "blind" to the visual structure and multimodal data (images + text) inside documents.</div>
</section>

<section id="#how" class="chapter reveal" data-section>
<h2>The Doc-Researcher Solution</h2>
<p>The researchers created a three-step brain for the AI:</p>
<div class="highlight-row">
<div class="highlight-card">
<strong>1. Smart Parsing</strong>
<div>It doesn't just copy text; it sees where every chart and table is, preserving its meaning.</div>
</div>
<div class="highlight-card">
<strong>2. Hybrid Search</strong>
<div>It can look for things by text descriptions or by visual looks, picking the best way to find evidence.</div>
</div>
<div class="highlight-card">
<strong>3. Teamwork Agents</strong>
<div>Instead of one try, it uses several "AI agents" that brainstorm, look for more clues, and combine them into a final answer.</div>
</div>
</div>
</section>

<section id="test" class="chapter reveal" data-section>
<h2>Real-World Results</h2>
<p>The team created a new test called <strong>M4DocBench</strong>. It has 158 very hard questions that requires "jumping" between different documents and looking at pictures to find the answer.</p>
<div class="callout">Doc-Researcher got 50.6% accuracy, which is 3.4 times better than previous top-tier AI systems!</div>
</section>

<section id="grad" class="chapter reveal" data-section>
<h2>Curious about the math and logic?</h2>
<p>If you want to see the specific technical architecture and deep data science behind this, check out the Graduate version.</p>
<a href="/papers/https-arxiv-org-abs-2511-13719/grad-en.html" class="btn primary">View Graduate Version (EN)</a>
</section>

<footer class="footer">WAP - Simplified paper insights.</footer>
</div>
<script src="/papers/https-arxiv-org-abs-2511-13719/script.js"></script>
</body>
</html>
67 changes: 67 additions & 0 deletions papers/https-arxiv-org-abs-2511-13719/hs-zh.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
<!doctype html>
<html lang="zh-CN">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Doc-Researcher (高中版) | WAP</title>
<link rel="stylesheet" href="/papers/https-arxiv-org-abs-2511-13719/styles.css" />
<link href="https://fonts.googleapis.com/css2?family=Noto+Sans+SC:wght@300;400;500;700&display=swap" rel="stylesheet" />
</head>
<body>
<div class="backdrop" aria-hidden="true"></div>
<div class="page">
<header class="hero reveal">
<div class="eyebrow">科普 / 高中版</div>
<h1>AI 如何阅读复杂的文档:Doc-Researcher 详解</h1>
<p class="subtitle">大多数 AI 系统只能“读文字”。Doc-Researcher 却像人类专家一样,能够读懂图表、表格和文档布局。</p>
</header>

<nav class="section-nav reveal">
<a href="#problem">现状与挑战</a>
<a href="#how">它是如何工作的</a>
<a href="#test">测试结果</a>
<a href="#grad">深入研究</a>
</nav>

<section id="problem" class="chapter reveal" data-section>
<h2>传统 AI 的“盲区”</h2>
<p>想象一下,让你分析一份 50 页的财务报告或一篇科学论文。大多数 AI 只能提取其中的文本,但当遇到复杂的结构图、数学公式或隐藏在表格中的数据时,它们就会感到困惑。由于丢失了图片和布局信息,AI 无法从真正专业的文档中获取深层知识。</p>
<div class="callout">关键缺失:AI 以前由于无法“看懂”图片的视觉结构,导致在处理复杂文档时存在巨大盲区。</div>
</section>

<section id="how" class="chapter reveal" data-section>
<h2>Doc-Researcher 的解决方案</h2>
<p>研究人员为 AI 打造了三个关键组件:</p>
<div class="highlight-row">
<div class="highlight-card">
<strong>1. 深度多模态解析</strong>
<div>它不只是复制文字,而是会识别每个图表和表格的位置,保存它们的视觉含义。</div>
</div>
<div class="highlight-card">
<strong>2. 混合式搜索</strong>
<div>它既可以通过文字描述来搜索,也可以通过视觉特征来寻找证据,从而选择最佳路径。</div>
</div>
<div class="highlight-card">
<strong>3. 迭代协作流</strong>
<div>它使用多个“AI 智能体”进行团队协作:有的负责拆解问题,有的负责寻找证据,最后合并成完整答案。</div>
</div>
</div>
</section>

<section id="test" class="chapter reveal" data-section>
<h2>真实表现如何?</h2>
<p>团队创建了一个名为 <strong>M4DocBench</strong> 的新测试,包含 158 个非常困难的问题,这些问题需要 AI 在多个文档之间“跳转”并查看图片才能回答。</p>
<div class="callout">Doc-Researcher 的准确率达到了 50.6%,比之前最先进的 AI 系统提高了 3.4 倍!</div>
</section>

<section id="grad" class="chapter reveal" data-section>
<h2>想要了解更深层的逻辑?</h2>
<p>如果你想了解这背后的具体架构和深层数据科学,请查看研究生版本。</p>
<a href="/papers/https-arxiv-org-abs-2511-13719/grad-zh.html" class="btn primary">查看研究生版本 (中文)</a>
</section>

<footer class="footer">WAP - 让科学论文通俗易懂。</footer>
</div>
<script src="/papers/https-arxiv-org-abs-2511-13719/script.js"></script>
</body>
</html>
Loading