Attention Is All You Need注意力机制即你所需
diff --git a/papers/https-arxiv-org-abs-2511-13719/grad-en.html b/papers/https-arxiv-org-abs-2511-13719/grad-en.html
new file mode 100644
index 0000000..2364295
--- /dev/null
+++ b/papers/https-arxiv-org-abs-2511-13719/grad-en.html
@@ -0,0 +1,88 @@
+
+
+
+
+
+ Doc-Researcher (Grad-EN) | WAP
+
+
+
+
+
+
+
+
GRADUATE EDITION / ENGLISH
+
Doc-Researcher: Overcoming the Multimodal Processing Bottleneck
+
A technical deep-dive into deep multimodal parsing, adaptive retrieval, and agentic evidence synthesis.
+
+
+
+
+
+
Motivation & Problem Statement
+
Current "Deep Research" systems (based on LLMs) are largely restricted to text-based web scraping. In professional and scientific domains, knowledge is dense in highly structured multimodal documents (PDFs/Scans). Standard RAG (Retrieval-Augmented Generation) pipelines fail here because they often "flatten" the structure, losing vital visual semantics like the relationship between a chart's axes or the hierarchical context of a table.
+
+
+
+
I. Deep Multimodal Parsing
+
Doc-Researcher employs a parsing engine that preserves multimodal integrity. It creates multi-granular representations:
+
+
Chunk-level: Captures local context including equations and inline symbols.
+
Block-level: Respects logical visual boundaries (e.g., a specific figure with its caption).
+
Document-level: Maintains layout hierarchy and global semantics.
+
+
Key Innovation: The system maps visual elements to text descriptions while keeping the original pixel features for vision-centric retrieval.
+
+
+
+
II. Systematic Hybrid Retrieval
+
The system utilizes an architecture that supports three paradigms:
+
+
Text-only: Standard semantic search on text chunks.
+
Vision-only: Directly retrieving document segments based on visual similarity.
+
Hybrid: Combining text and vision signals with dynamic granularity selection—choosing between fine-grained chunks or broader document context based on query ambiguity.
+
+
+
+
+
III. Iterative Multi-Agent Workflows
+
Unlike single-pass retrieval, Doc-Researcher uses an agentic loop:
+
+
Planner: Decomposes complex, multi-hop queries into sub-tasks.
+
Searcher: Executes the hybrid retrieval to find candidates.
+
Refiner: Evaluates retrieved evidence and decides if more searching is needed (iterative accumulation).
+
Synthesizer: Integrates multimodal evidence to form a final, cited answer.
+
+
+
+
+
M4DocBench & Evaluation
+
To evaluate these capabilities, the authors introduced M4DocBench (Multi-modal, Multi-hop, Multi-document, and Multi-turn). It consists of 158 expert-level questions spanning 304 documents. This benchmark requires the model to "connect the dots" across multiple files and modalities.
+
+
+
+
Experimental Outcomes
+
+
+ Direct Comparison
+
50.6% accuracy vs. ~15% for state-of-the-art baselines (3.4x improvement).
+
+
+ Ablation
+
Removing the "Visual Semantics" component caused the largest performance drop, proving layout matters.
Most AI systems only "read" text. Doc-Researcher is a new system that actually understands charts, tables, and layouts like a human expert does.
+
+
+
+
+
+
The "Wall" for Traditional AI
+
Imagine asking an AI to analyze a 50-page financial report or a scientific paper. Most current AIs can grab the text, but they get confused by complex layout diagrams, math equations, or data hidden in tables. They treat everything like a flat block of words, missing the "visual language" of the document.
+
The Gap: AI has been "blind" to the visual structure and multimodal data (images + text) inside documents.
+
+
+
+
The Doc-Researcher Solution
+
The researchers created a three-step brain for the AI:
+
+
+ 1. Smart Parsing
+
It doesn't just copy text; it sees where every chart and table is, preserving its meaning.
+
+
+ 2. Hybrid Search
+
It can look for things by text descriptions or by visual looks, picking the best way to find evidence.
+
+
+ 3. Teamwork Agents
+
Instead of one try, it uses several "AI agents" that brainstorm, look for more clues, and combine them into a final answer.
+
+
+
+
+
+
Real-World Results
+
The team created a new test called M4DocBench. It has 158 very hard questions that requires "jumping" between different documents and looking at pictures to find the answer.
+
Doc-Researcher got 50.6% accuracy, which is 3.4 times better than previous top-tier AI systems!
+
+
+
+
Curious about the math and logic?
+
If you want to see the specific technical architecture and deep data science behind this, check out the Graduate version.
Doc-Researcher: A Unified System for Multimodal Document Parsing and Deep Research
+
A groundbreaking system that solves complex research queries by deeply parsing multimodal documents (figures, tables, charts) and using iterative agent workflows.