Skip to content

JSON File Ingestion – Handling Metadata and Chunking #369

@troublesprouter

Description

@troublesprouter

**JSON File Ingestion – Handling Metadata and Chunking **

Description

When uploading a JSON file, I need Verba to properly ingest structured metadata while still generating chunks automatically. Currently, the behavior is unclear, and it seems that the "chunks" field must be predefined, even though Verba can generate chunks for PDFs automatically.

Expected Behavior:

  • Verba should recognize metadata fields without requiring predefined chunks.
  • The "content" field should be processed as document text.
  • Chunking should be handled automatically based on Verba’s settings.

Actual Behavior:

  • The "chunks" field appears necessary, even though I want Verba to generate them dynamically.
  • Metadata structure is unclear—what should be included for proper indexing?

Example JSON File:

{
  "year": 1995,
  "number": "50",
  "title": "Circular Nº 50, del 13 de Diciembre de 1995 (modificada) (aclarada / complementada)",
  "materia": "Crédito Tributario por inversiones en provincias de Arica y Parinacota",
  "url": "https://www.sii.cl/documentos/circulares/1995/circu50.pdf",
  "sin_efecto": false,
  "downloaded_filename": "circu50.pdf",
  "saved_filename": "circular_1995_50_2.pdf",
  "content": "Modificada por Circular Nº 45, del 3 de septiembre de 2008 \n\nModificadas por Circular Nº 64, del 6 de noviembre de 1996 \n\nComplementada por Circular Nº 64, del 6 de noviembre de 1996 \n\nCIRCULAR Nº 50, DEL 13 DE D ETC ETC etc",
  "modificada": true,
  "aclarada_complementada": true
}

Installation

  • pip install goldenverba

If you installed via pip, please specify the version:

Weaviate Deployment

  • Local Deployment

Steps to Reproduce

  1. Go to the dashboard.
  2. Upload a JSON file with structured metadata.
  3. Metadata doesnt load, not does the title of the document, etc.

Additional Context

  • Do I need to structure metadata differently for proper indexing?
  • Should Verba automatically generate chunks even when metadata is present?
  • If so, how should metadata fields be formatted?
  • Is there a recommended JSON structure for structured documents without manually defining chunks?

@thomashacker Any guidance on this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions