Skip to content

Conversation

@hanlianlu
Copy link
Contributor

Description
MinerU 2.7.0+ uses hybrid_auto/ subdirectory for hybrid-auto-engine backend output, but RAG-Anything hardcoded auto/ paths, causing file-not-found errors. Additionally, backend options have been updated to match MinerU 2.7.1+ CLI specification.

Related Issues
Addresses issue #186

Changes Made
Core Fix: Dynamic Directory Detection

_read_output_files() now scans subdirectories for *_content_list.json marker files instead of hardcoding paths
Automatically detects output in auto/, hybrid_auto/, vlm/, or any future backend directories
Falls back to method-based path construction when subdirectory scan finds nothing
Backend Mapping Enhancement

Added hybrid-* backend detection to map method="auto" → method="hybrid_auto"
Extends existing vlm-* backend pattern for consistency
Backend Options Update

Updated all backend references to match MinerU 2.7.1+ CLI specification
Replaced outdated backend names (vlm-transformers, vlm-sglang-engine, vlm-sglang-client)
Added current backend options: pipeline, hybrid-auto-engine, hybrid-http-client, vlm-auto-engine, vlm-http-client
Updated CLI argument parser, README.md, README_zh.md, and all vlm_url references
Documentation

Added hybrid-auto-engine and other current backends to CLI choices and README backend options
Inline comments document MinerU 2.7.0+ directory naming conventions

Before (fails with hybrid-auto-engine)

file_stem_subdir = output_dir / file_stem
if file_stem_subdir.exists():
md_file = file_stem_subdir / method / f"{file_stem}.md" # Assumes method matches directory name

After (works with any backend)

if file_stem_subdir.exists():
found = False
for subdir in file_stem_subdir.iterdir():
if (subdir / f"{file_stem}_content_list.json").exists():
md_file = subdir / f"{file_stem}.md" # Uses actual directory
found = True
break
Checklist
Changes tested locally
Code reviewed
Documentation updated (if necessary)
Unit tests added (if applicable)

Copilot AI and others added 12 commits January 8, 2026 20:40
- Modified _read_output_files() to scan subdirectories for actual output
- Added backend mapping for hybrid-* backends to use hybrid_auto method
- Maintains backward compatibility with pipeline, vlm, and other backends
- Successfully tested with pipeline, hybrid_auto, vlm, and custom backends

Co-authored-by: hanlianlu <7419836+hanlianlu@users.noreply.github.com>
- Updated README.md and README_zh.md with hybrid-auto-engine backend option
- Added hybrid-auto-engine to CLI argument choices
- Enhanced code comments explaining backend-to-directory mapping
- Documents MinerU 2.7.0+ backend directory naming conventions

Co-authored-by: hanlianlu <7419836+hanlianlu@users.noreply.github.com>
- Replace outdated backend names (vlm-transformers, vlm-sglang-engine, vlm-sglang-client)
- Add current MinerU 2.7.1+ backend options (hybrid-http-client, vlm-auto-engine, vlm-http-client)
- Update vlm_url references to use vlm-http-client instead of vlm-sglang-client
- Changes made in parser.py, README.md, and README_zh.md

Co-authored-by: hanlianlu <7419836+hanlianlu@users.noreply.github.com>
Co-authored-by: hanlianlu <7419836+hanlianlu@users.noreply.github.com>
…-issue

Fix MinerU hybrid-auto-engine output directory detection and update backend options
Update backend choices to match MinerU 2.7.1 CLI specification
Co-authored-by: hanlianlu <7419836+hanlianlu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant