Skip to content

Conversation

@lfoppiano
Copy link
Collaborator

@lfoppiano lfoppiano commented Sep 13, 2025

The new parameter --json allow to output both XML TEI and JSON files.
Markdown format to come

Example:
science.abj2096.grobid.tei.xml
science.abj2096.json

/cc @ericjeangirard

@lfoppiano lfoppiano mentioned this pull request Sep 13, 2025
@lfoppiano lfoppiano requested a review from Copilot October 12, 2025 20:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds JSON output functionality to the GROBID client, allowing users to convert TEI XML output to structured JSON format similar to CORD-19. The implementation includes a new --json command-line flag and corresponding json_output parameter for the Python API.

Key changes:

  • Added --json command-line flag and json_output parameter throughout the processing pipeline
  • Implemented TEI2LossyJSONConverter class for converting TEI XML to structured JSON
  • Updated documentation with JSON output examples and format specification

Reviewed Changes

Copilot reviewed 3 out of 5 changed files in this pull request and generated 5 comments.

File Description
grobid_client/grobid_client.py Added json_output parameter throughout processing pipeline and JSON conversion logic
grobid_client/format/TEI2LossyJSON.py New converter class for TEI XML to JSON transformation with streaming and batch processing capabilities
Readme.md Added comprehensive documentation for JSON output feature with examples and format specification

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

lfoppiano and others added 4 commits October 12, 2025 22:18
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@lfoppiano lfoppiano requested a review from Copilot October 13, 2025 13:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 5 changed files in this pull request and generated 4 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

from bs4 import BeautifulSoup
import dateparser
# Reuse existing top-level helpers from this module by importing here
from grobid_client.format.TEI2LossyJSON import box_to_dict, get_random_id, get_formatted_passage, get_refs_with_offsets, xml_table_to_json
Copy link

Copilot AI Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This circular import could cause issues. Consider refactoring helper functions into a separate utilities module or importing the current module differently.

Copilot uses AI. Check for mistakes.
Comment on lines 336 to 337
if ref_text in text:
start_offset = text.find(ref_text)
Copy link

Copilot AI Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using text.find(ref_text) may return the wrong offset if the same reference text appears multiple times in the passage. This could lead to incorrect offset calculations for references.

Copilot uses AI. Check for mistakes.
@lfoppiano lfoppiano requested a review from Copilot October 13, 2025 13:51
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 5 changed files in this pull request and generated 2 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

from typing import Tuple
import copy

from .format.TEI2LossyJSON import TEI2LossyJSONConverter
Copy link

Copilot AI Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding import validation or error handling for the TEI2LossyJSON module to gracefully handle cases where dependencies might be missing.

Suggested change
from .format.TEI2LossyJSON import TEI2LossyJSONConverter
try:
from .format.TEI2LossyJSON import TEI2LossyJSONConverter
except ImportError as e:
TEI2LossyJSONConverter = None
import warnings
warnings.warn(
"Optional dependency 'TEI2LossyJSONConverter' could not be imported. "
"Some features may not be available. Original error: {}".format(e),
ImportWarning
)

Copilot uses AI. Check for mistakes.
@lfoppiano lfoppiano merged commit 7cd66c0 into master Oct 31, 2025
6 checks passed
@lfoppiano lfoppiano deleted the feature/json_output branch October 31, 2025 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants