Fix #186: MinerU 2.7.0+ caused filepath errors with new default backend #188
+46
−12
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
MinerU 2.7.0+ uses hybrid_auto/ subdirectory for hybrid-auto-engine backend output, but RAG-Anything hardcoded auto/ paths, causing file-not-found errors. Additionally, backend options have been updated to match MinerU 2.7.1+ CLI specification.
Related Issues
Addresses issue #186
Changes Made
Core Fix: Dynamic Directory Detection
_read_output_files() now scans subdirectories for *_content_list.json marker files instead of hardcoding paths
Automatically detects output in auto/, hybrid_auto/, vlm/, or any future backend directories
Falls back to method-based path construction when subdirectory scan finds nothing
Backend Mapping Enhancement
Added hybrid-* backend detection to map method="auto" → method="hybrid_auto"
Extends existing vlm-* backend pattern for consistency
Backend Options Update
Updated all backend references to match MinerU 2.7.1+ CLI specification
Replaced outdated backend names (vlm-transformers, vlm-sglang-engine, vlm-sglang-client)
Added current backend options: pipeline, hybrid-auto-engine, hybrid-http-client, vlm-auto-engine, vlm-http-client
Updated CLI argument parser, README.md, README_zh.md, and all vlm_url references
Documentation
Added hybrid-auto-engine and other current backends to CLI choices and README backend options
Inline comments document MinerU 2.7.0+ directory naming conventions
Before (fails with hybrid-auto-engine)
file_stem_subdir = output_dir / file_stem
if file_stem_subdir.exists():
md_file = file_stem_subdir / method / f"{file_stem}.md" # Assumes method matches directory name
After (works with any backend)
if file_stem_subdir.exists():
found = False
for subdir in file_stem_subdir.iterdir():
if (subdir / f"{file_stem}_content_list.json").exists():
md_file = subdir / f"{file_stem}.md" # Uses actual directory
found = True
break
Checklist
Changes tested locally
Code reviewed
Documentation updated (if necessary)
Unit tests added (if applicable)