GML-2011 Pymupdf4llm integration #18

prinskumar-tigergraph · 2025-11-19T06:46:00Z

PR Type

Enhancement, Documentation, Other

Description

Graph lifecycle and ingestion REST APIs
ECC rebuild tracking and status endpoint
PDF extraction via pymupdf4llm with tg:// images
UI setup, uploads, and cloud downloads

Diagram Walkthrough

flowchart LR
  UI["UI (Setup, Uploads)"]
  APIRouter["UI Router (graph, ingest, uploads, cloud)"]
  ECC["ECC Service (task tracking, status)"]
  PDF["pymupdf4llm PDF → Markdown"]
  MD["MarkdownProcessor (describe/replace to tg://)"]
  Agent["agent_graph (tg:// → /ui/image_vertex)"]

  UI -- "calls" --> APIRouter
  APIRouter -- "rebuild request" --> ECC
  APIRouter -- "create/init/ingest" --> "TigerGraph"
  PDF -- "markdown + images" --> MD
  MD -- "markdown with tg:// refs" --> Agent
  Agent -- "serve images" --> "/ui/image_vertex"

File Walkthrough

Relevant files

Enhancement

13 files

ui.py `Add graph APIs, ingestion, uploads, and cloud download`	+746/-6
text_extractors.py Use pymupdf4llm; emit `tg://` image references	+128/-118
image_data_extractor.py `Path-based LLM vision; remove legacy saver`	+31/-131
markdown_parsing.py `New Markdown image parser and replacer`	+63/-0
main.py `Track rebuild tasks; add rebuild status endpoint`	+81/-6
supportai.py `Rename forceupdate handler; add create graph API`	+45/-2
agent_graph.py Convert `tg://` image links to UI endpoint	+14/-9
inquiryai.py Convert `tg://` image links in responses	+5/-5
ModeToggle.tsx `Add Setup button and logout confirmation dialog`	+42/-17
Bot.tsx `Refresh graph list on focus/navigation`	+29/-3
confirm-dialog.tsx `New reusable confirmation dialog component`	+64/-0
useConfirm.tsx `New hook to request user confirmation`	+44/-0
main.tsx Register `/setup` route and layout integration	+5/-0

Documentation

6 files

workers.py `Update comments for markdown image handling`	+1/-1
single_chunker.py `Clarify preserving markdown image references`	+2/-2
ecc_util.py `Note single chunker preserves image references`	+1/-1
README.md `Update OpenAI/Bedrock config and prompts path`	+30/-21
chatbot_response.txt `Instruct preserving markdown image references`	+1/-1
pymupdf4llm-AGPL-3.0.txt `Add pymupdf4llm AGPL-3.0 license file`	+661/-0

Configuration changes

2 files

vite.config.ts Proxy `/{graph}/graphrag/*` to backend	+2/-1
nginx.conf Increase upload size and timeouts for `/ui`	+7/-2

Dependencies

1 files

requirements.txt `Replace PyMuPDF with pymupdf4llm; add multipart`	+3/-1

Additional files

5 files

chatbot_response.txt	+1/-1
docker-compose.yml	+0/-1
nginx.conf	+0/-32
nginx.conf	+0/-32
Setup.tsx	+1862/-0

tg-pr-agent · 2025-11-19T06:48:47Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 Security concerns Path traversal: Upload and deletion endpoints in graphrag/app/routers/ui.py use user-provided filenames directly in filesystem paths (e.g., os.path.join(upload_dir, file.filename) and os.path.join(upload_dir, filename)). This allows attackers to traverse directories or overwrite arbitrary files. Sanitize via os.path.basename, restrict allowed characters, and reject paths containing path separators. Potential command injection in GSQL: Endpoints constructing GSQL with f"CREATE GRAPH {graphname}()" accept unvalidated path params. Validate graph names against a strict regex (e.g., ^[A-Za-z_][A-Za-z0-9_]*$) or use APIs that avoid string interpolation. Sensitive information exposure in logs: Cloud download handlers may log exceptions that include credentials (e.g., gcs_credentials_json parsing errors, SDK ClientError). Ensure logs redact secrets and remove full trace logs containing credentials (logger.debug_pii). Minor info exposure: init_graph returns host_name which might reveal internal topology. Consider omitting or gating behind debug.
⚡ Recommended focus areas for review Possible Issue Missing import for httpx in the rebuild status endpoint will cause a NameError at runtime when calling the ECC status URL. response = httpx.get( ecc_status_url, headers={"Authorization": f"Basic {auth}"}, timeout=10.0 ) Security Unsanitized filenames in upload and delete endpoints allow path traversal (using user-supplied filenames directly in os.path.join). Sanitize with os.path.basename, validate allowed characters, and reject paths containing separators. Review uploads, single-file delete, and bulk delete handlers. existing_files = [] for file in files: file_path = os.path.join(upload_dir, file.filename) if os.path.exists(file_path): existing_files.append(file.filename) if existing_files: return { "status": "conflict", "message": "Some files already exist. Set overwrite=true to replace them.", "existing_files": existing_files, } # Save uploaded files uploaded_files = [] total_size = 0 for file in files: file_path = os.path.join(upload_dir, file.filename) # Write file to disk with open(file_path, "wb") as f: content = await file.read() f.write(content) file_size = len(content) total_size += file_size uploaded_files.append({ "filename": file.filename, "size": file_size, "path": file_path, }) logger.info(f"Uploaded file {file.filename} ({file_size} bytes) for graph {graphname}") return { "status": "success", "message": f"Successfully uploaded {len(uploaded_files)} file(s)", "graphname": graphname, "uploaded_files": uploaded_files, "total_size": total_size, } except Exception as e: exc = traceback.format_exc() logger.error(f"Error uploading files for graph {graphname}: {e}") logger.debug_pii(f"Upload error trace:\n{exc}") raise HTTPException(status_code=500, detail=f"Error uploading files: {str(e)}") @router.delete(route_prefix + "/{graphname}/uploads") async def clear_uploaded_files( graphname: str, creds: Annotated[tuple[list[str], HTTPBasicCredentials], Depends(ui_basic_auth)], filename: str \| None = None, ): """ Clear uploaded files for a specific graphname. Parameters: - graphname: The graph name whose files to clear - filename: If provided, only delete this specific file. Otherwise, delete all files. """ try: upload_dir = os.path.join("uploads", graphname) if not os.path.exists(upload_dir): return { "status": "success", "message": f"No files found for graph {graphname}", "deleted_files": [], } deleted_files = [] if filename: # Delete specific file file_path = os.path.join(upload_dir, filename) if os.path.exists(file_path) and os.path.isfile(file_path): os.remove(file_path) deleted_files.append(filename) logger.info(f"Deleted file {filename} for graph {graphname}") else: raise HTTPException(status_code=404, detail=f"File {filename} not found") else: Possible Issue Small image filter currently does nothing (pass) and continues processing, likely unintended. Either remove the check or explicitly skip/return to avoid processing tiny images. if pil_image.width < 100 or pil_image.height < 100: pass description = describe_image_with_llm(str(Path(file_path).absolute())) description_lower = description.lower()

…df4llm integration

…ead-safe)

…ctors.py

graphrag/app/supportai/supportai.py

chengbiao-jin · 2025-12-04T23:39:12Z

graphrag-ui/src/pages/Setup.tsx

-                  <CloudLightning className="h-4 w-4 mr-2" />
-                  Use Amazon BDA
+                <TabsTrigger value="s3">
+                  <CloudCog className="h-4 w-4 mr-2" />


@prinskumar-tigergraph can you please double check if your branch is based on latest main?

prinskumar-tigergraph force-pushed the pymupdf4llm_integration branch from e7c7554 to 7c602b1 Compare November 20, 2025 16:40

prins-agivant added 7 commits November 26, 2025 20:39

Add S3 Bedrock BDA ingestion support with user confirmation and pymup…

06e5ec0

…df4llm integration

Update README for OpenAI and Bedrock config, add pymupdf4llm license

788fe2a

Update README for OpenAI and Bedrock config, add pymupdf4llm license

58a86d1

Fix prompt_path to use ./common/prompts/ for OpenAI and Bedrock

c20aff8

bug fixes

5a0f87c

Fix PDF extractions

3bfe5c1

Fix PDF extraction threading issue: add lock for pymupdf4llm (not thr…

a660bb7

…ead-safe)

prinskumar-tigergraph force-pushed the pymupdf4llm_integration branch from c2d8495 to a660bb7 Compare November 26, 2025 15:11

prins-agivant added 2 commits December 1, 2025 17:08

Merge latest main and consolidate markdown_parsing.py into text_extra…

214b2e1

…ctors.py

Merge latest main and consolidate markdown_parsing.py into text_extra…

dd57724

…ctors.py

prinskumar-tigergraph changed the title ~~GML-2011: Pymupdf4llm integration~~ GML-2011 Pymupdf4llm integration Dec 1, 2025

chengbiao-jin reviewed Dec 1, 2025

View reviewed changes

graphrag/app/supportai/supportai.py Outdated Show resolved Hide resolved

Change folder_path to data_path for consistency

ac2f0c5

chengbiao-jin approved these changes Dec 3, 2025

View reviewed changes

chengbiao-jin reviewed Dec 4, 2025

View reviewed changes

chengbiao-jin closed this Dec 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GML-2011 Pymupdf4llm integration #18

GML-2011 Pymupdf4llm integration #18

prinskumar-tigergraph commented Nov 19, 2025 •

edited by tg-pr-agent bot

Loading

Uh oh!

tg-pr-agent bot commented Nov 19, 2025

Uh oh!

Uh oh!

chengbiao-jin Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

GML-2011 Pymupdf4llm integration #18

GML-2011 Pymupdf4llm integration #18

Conversation

prinskumar-tigergraph commented Nov 19, 2025 • edited by tg-pr-agent bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

tg-pr-agent bot commented Nov 19, 2025

PR Reviewer Guide 🔍

Uh oh!

Uh oh!

chengbiao-jin Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

prinskumar-tigergraph commented Nov 19, 2025 •

edited by tg-pr-agent bot

Loading