Skip to content

Inconsistent Metadata: extraction_metadata returns unusable table ID for content related to Table chunks (expected chunk_id) #65

@PrudhviGudla

Description

@PrudhviGudla

I am encountering an issue when using the extract() method on documents containing tables.

When validating the extraction output against the source chunks (for visual grounding/bounding boxes), the extraction_metadata usually provides a chunk_id that corresponds to an ID in the chunks list from the parse output.

However, if the extracted field comes from a Table, the extraction_metadata references an ID (likely an internal table_id) that does not exist in the parsed chunks list. This makes it impossible to link the extracted table content back to its location in the original document.

To Reproduce:

  • Parse a document containing a table using ade.parse().
  • Define a schema that extracts data specifically from that table.
  • Run ade.extract().
  • Inspect the extraction_metadata for the table-derived field.
  • Attempt to find that ID in the list of chunks returned by step 1.

Expected Behavior: The ID returned in extraction_metadata should match the chunk_id (or id) present in the parsed chunks list, allowing users to look up bounding boxes and page numbers.

Actual Behavior: The ID returned in extraction_metadata for the table field is unique, but cannot be found in the chunks list, breaking the grounding link. It is actually the table ID of the table in the markdown output of the parse() method on a document, which is unreliable as it's not unique (unlike the chunk ID). It's probably an issue in the ADE working behind the scenes in the extract() method, which is picking up the table ID but not the chunk ID just above it in the markdown

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions