Attention Is All You Need注意力机制即你所需
diff --git a/papers/https-arxiv-org-abs-2511-13719/grad-en.html b/papers/https-arxiv-org-abs-2511-13719/grad-en.html
deleted file mode 100644
index 2364295..0000000
--- a/papers/https-arxiv-org-abs-2511-13719/grad-en.html
+++ /dev/null
@@ -1,88 +0,0 @@
-
-
-
-
-
- Doc-Researcher (Grad-EN) | WAP
-
-
-
-
-
-
-
-
GRADUATE EDITION / ENGLISH
-
Doc-Researcher: Overcoming the Multimodal Processing Bottleneck
-
A technical deep-dive into deep multimodal parsing, adaptive retrieval, and agentic evidence synthesis.
-
-
-
-
-
-
Motivation & Problem Statement
-
Current "Deep Research" systems (based on LLMs) are largely restricted to text-based web scraping. In professional and scientific domains, knowledge is dense in highly structured multimodal documents (PDFs/Scans). Standard RAG (Retrieval-Augmented Generation) pipelines fail here because they often "flatten" the structure, losing vital visual semantics like the relationship between a chart's axes or the hierarchical context of a table.
-
-
-
-
I. Deep Multimodal Parsing
-
Doc-Researcher employs a parsing engine that preserves multimodal integrity. It creates multi-granular representations:
-
-
Chunk-level: Captures local context including equations and inline symbols.
-
Block-level: Respects logical visual boundaries (e.g., a specific figure with its caption).
-
Document-level: Maintains layout hierarchy and global semantics.
-
-
Key Innovation: The system maps visual elements to text descriptions while keeping the original pixel features for vision-centric retrieval.
-
-
-
-
II. Systematic Hybrid Retrieval
-
The system utilizes an architecture that supports three paradigms:
-
-
Text-only: Standard semantic search on text chunks.
-
Vision-only: Directly retrieving document segments based on visual similarity.
-
Hybrid: Combining text and vision signals with dynamic granularity selection—choosing between fine-grained chunks or broader document context based on query ambiguity.
-
-
-
-
-
III. Iterative Multi-Agent Workflows
-
Unlike single-pass retrieval, Doc-Researcher uses an agentic loop:
-
-
Planner: Decomposes complex, multi-hop queries into sub-tasks.
-
Searcher: Executes the hybrid retrieval to find candidates.
-
Refiner: Evaluates retrieved evidence and decides if more searching is needed (iterative accumulation).
-
Synthesizer: Integrates multimodal evidence to form a final, cited answer.
-
-
-
-
-
M4DocBench & Evaluation
-
To evaluate these capabilities, the authors introduced M4DocBench (Multi-modal, Multi-hop, Multi-document, and Multi-turn). It consists of 158 expert-level questions spanning 304 documents. This benchmark requires the model to "connect the dots" across multiple files and modalities.
-
-
-
-
Experimental Outcomes
-
-
- Direct Comparison
-
50.6% accuracy vs. ~15% for state-of-the-art baselines (3.4x improvement).
-
-
- Ablation
-
Removing the "Visual Semantics" component caused the largest performance drop, proving layout matters.
Most AI systems only "read" text. Doc-Researcher is a new system that actually understands charts, tables, and layouts like a human expert does.
-
-
-
-
-
-
The "Wall" for Traditional AI
-
Imagine asking an AI to analyze a 50-page financial report or a scientific paper. Most current AIs can grab the text, but they get confused by complex layout diagrams, math equations, or data hidden in tables. They treat everything like a flat block of words, missing the "visual language" of the document.
-
The Gap: AI has been "blind" to the visual structure and multimodal data (images + text) inside documents.
-
-
-
-
The Doc-Researcher Solution
-
The researchers created a three-step brain for the AI:
-
-
- 1. Smart Parsing
-
It doesn't just copy text; it sees where every chart and table is, preserving its meaning.
-
-
- 2. Hybrid Search
-
It can look for things by text descriptions or by visual looks, picking the best way to find evidence.
-
-
- 3. Teamwork Agents
-
Instead of one try, it uses several "AI agents" that brainstorm, look for more clues, and combine them into a final answer.
-
-
-
-
-
-
Real-World Results
-
The team created a new test called M4DocBench. It has 158 very hard questions that requires "jumping" between different documents and looking at pictures to find the answer.
-
Doc-Researcher got 50.6% accuracy, which is 3.4 times better than previous top-tier AI systems!
-
-
-
-
Curious about the math and logic?
-
If you want to see the specific technical architecture and deep data science behind this, check out the Graduate version.
Doc-Researcher: A Unified System for Multimodal Document Parsing and Deep Research
-
A groundbreaking system that solves complex research queries by deeply parsing multimodal documents (figures, tables, charts) and using iterative agent workflows.