-
Notifications
You must be signed in to change notification settings - Fork 81
Description
When using GROBID to parse PDFs into JSON format, paragraphs are assigned a head_section based on detected headings. However, if a paragraph is under a main heading and a subheading, the JSON output only includes the subheading in head_section. The parent/main heading is missing.
PDF structure:
Main heading: Methods
Subheading: Study design
Paragraph: "This study was conducted over a period of 6 months."
Current JSON output assigns only the subheading:
head_section: Study design
Text: "This study was conducted over a period of 6 months."
Below are example output files that demonstrate the issue:
GROBID TEI XML:
mjb3wlzxcb2mc-migowebupload-1766042162782.grobid.tei.xml
Generated JSON output:
mjb3wlzxcb2mc-migowebupload-1766042162782.json
In these files, paragraphs under nested headings only reference the lowest-level heading in head_section, while the parent heading is absent.