Skip to content

Conversation

@prinskumar-tigergraph
Copy link
Collaborator

@prinskumar-tigergraph prinskumar-tigergraph commented Nov 24, 2025

User description

Added Local temp folder to support processing files so that it wont run out of memory
Ui changes accordingly


PR Type

Enhancement, Documentation, Bug fix


Description

  • Temp storage for server ingestion review

  • UI to review/delete, optional direct ingest

  • PDF extraction via pymupdf4llm with captions

  • Add S3/Bedrock ingest config and docs


Diagram Walkthrough

flowchart LR
  UI["Setup UI (Upload/Cloud/S3)"]
  API["New ingestion temp endpoints"]
  Temp["uploads/ingestion_temp/<graph>/<session>/"]
  Ingest["Ingest reads JSONs and loads"]
  TG["TigerGraph"]
  PDF["PDF file"]
  MuPDF["pymupdf4llm to_markdown()"]
  MD["Markdown with tg:// images"]
  LLM["Describe images via LLM"]

  UI -- "create_ingest" --> API
  API -- "save docs as JSON" --> Temp
  UI -- "review/delete" --> API
  UI -- "Run final ingest" --> Ingest
  Ingest --> TG

  PDF -- "convert" --> MuPDF
  MuPDF --> MD
  MD -- "insert captions" --> LLM
  LLM --> MD
Loading

File Walkthrough

Relevant files
Enhancement
6 files
text_extractors.py
Switch PDF extraction to pymupdf4llm with temp folders     
+147/-126
supportai.py
Save server-processed docs to temp JSON and ingest             
+73/-15 
ui.py
Add endpoints to list and delete ingestion temp files       
+137/-0 
image_data_extractor.py
Simplify image description API; remove legacy saver           
+31/-132
markdown_parsing.py
New MarkdownProcessor for image refs and captions               
+63/-0   
Setup.tsx
UI review workflow, direct ingest toggle, S3 Bedrock         
+502/-214
Documentation
2 files
pymupdf4llm-AGPL-3.0.txt
Add AGPL-3.0 license for pymupdf4llm dependency                   
+661/-0 
README.md
Update setup, provider configs, and prompt paths                 
+43/-35 
Dependencies
1 files
requirements.txt
Add pymupdf4llm, remove direct PyMuPDF pin                             
+2/-1     

@prinskumar-tigergraph prinskumar-tigergraph changed the title Local temp files GML-2011: Local temp files to process files Nov 24, 2025
@tg-pr-agent tg-pr-agent bot changed the title GML-2011: Local temp files to process files Local temp files Nov 24, 2025
@tg-pr-agent
Copy link

tg-pr-agent bot commented Nov 24, 2025

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 Security concerns

Directory traversal:
The ingestion temp delete endpoint constructs a file path using an unvalidated 'filename' and 'session_id' (os.path.join(session_dir, filename)) and then deletes it if it exists. Attackers can supply values like '../../../...' to delete files outside the intended folder. Mitigations: resolve to absolute paths (os.path.abspath), verify with os.path.commonpath that the target path is within the session_dir, reject any filename containing path separators, and optionally whitelist allowed filename patterns (e.g., r'^[\w-.]+.json$').

⚡ Recommended focus areas for review

Path Traversal Risk

The delete endpoint joins user-supplied 'filename' with the session directory without validating or normalizing the path. This can allow directory traversal and deletion outside the intended session folder. Add path normalization and enforce that the resolved path stays within the session directory; also reject filenames containing path separators.

if filename:
    # Delete specific file
    file_path = os.path.join(session_dir, filename)
    if os.path.exists(file_path) and os.path.isfile(file_path):
        os.remove(file_path)
        deleted_files.append(filename)
        logger.info(f"Deleted temp file {filename} from session {session_id}")

        # If session folder is now empty, remove it
        if not os.listdir(session_dir):
            os.rmdir(session_dir)
            logger.info(f"Removed empty session folder {session_id}")
    else:
        raise HTTPException(status_code=404, detail=f"File {filename} not found")
Possible Issue

Small image filter no longer works as intended. The 'if pil_image.width < 100 or pil_image.height < 100:' branch just 'pass'es, so small images are still processed instead of being skipped, changing prior behavior and potentially adding noise and cost.

if pil_image.width < 100 or pil_image.height < 100:
    pass

description = describe_image_with_llm(str(Path(file_path).absolute()))
description_lower = description.lower()
logo_indicators = ['logo:', 'icon:', 'logo', 'icon', 'branding',
Doc Type Classification

Defaulting to 'markdown' for unknown file extensions may misclassify unsupported files (e.g., .json, .bin) and cause downstream processing issues. Consider handling '.md' explicitly and returning an empty string or a safe default for unsupported extensions.

if extension in ['.html', '.htm']:
    return 'html'
elif extension in ['.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff', '.webp']:
    return 'image'
else:
    return 'markdown'

@chengbiao-jin
Copy link
Collaborator

@prinskumar-tigergraph please rebase to latest main.

@prinskumar-tigergraph prinskumar-tigergraph changed the title Local temp files GML-2011 Local temp files Dec 1, 2025
for idx, doc_data in enumerate(doc_entries):
# Use file_counter for unique naming across all files
counter_val = next(file_counter) if file_counter else idx
doc_filename = f"doc_{counter_val}_{doc_data.get('doc_id', 'unknown')}.json"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we're already write text/image to separate json files, we can consider using the old jsonl format to include everything in a single jsonl file, which can be loaded directly with conn.runLoadingJobWithFile() without needing extra processing. @prinskumar-tigergraph

@prinskumar-tigergraph
Copy link
Collaborator Author

closing this PR, created the separate branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants