-
Notifications
You must be signed in to change notification settings - Fork 71
Description
I am encountering an issue when using the extract() method on documents containing tables.
When validating the extraction output against the source chunks (for visual grounding/bounding boxes), the extraction_metadata usually provides a chunk_id that corresponds to an ID in the chunks list from the parse output.
However, if the extracted field comes from a Table, the extraction_metadata references an ID (likely an internal table_id) that does not exist in the parsed chunks list. This makes it impossible to link the extracted table content back to its location in the original document.
To Reproduce:
- Parse a document containing a table using ade.parse().
- Define a schema that extracts data specifically from that table.
- Run ade.extract().
- Inspect the extraction_metadata for the table-derived field.
- Attempt to find that ID in the list of chunks returned by step 1.
Expected Behavior: The ID returned in extraction_metadata should match the chunk_id (or id) present in the parsed chunks list, allowing users to look up bounding boxes and page numbers.
Actual Behavior: The ID returned in extraction_metadata for the table field is unique, but cannot be found in the chunks list, breaking the grounding link. It is actually the table ID of the table in the markdown output of the parse() method on a document, which is unreliable as it's not unique (unlike the chunk ID). It's probably an issue in the ADE working behind the scenes in the extract() method, which is picking up the table ID but not the chunk ID just above it in the markdown