Skip to content

JSON output for paragraphs does not include parent headings #100

@Sanakhamassi

Description

@Sanakhamassi

When using GROBID to parse PDFs into JSON format, paragraphs are assigned a head_section based on detected headings. However, if a paragraph is under a main heading and a subheading, the JSON output only includes the subheading in head_section. The parent/main heading is missing.

PDF structure:
Main heading: Methods
Subheading: Study design
Paragraph: "This study was conducted over a period of 6 months."

Current JSON output assigns only the subheading:
head_section: Study design
Text: "This study was conducted over a period of 6 months."

Below are example output files that demonstrate the issue:

GROBID TEI XML:
mjb3wlzxcb2mc-migowebupload-1766042162782.grobid.tei.xml
Generated JSON output:
mjb3wlzxcb2mc-migowebupload-1766042162782.json

In these files, paragraphs under nested headings only reference the lowest-level heading in head_section, while the parent heading is absent.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions