-
Notifications
You must be signed in to change notification settings - Fork 81
JSON output #91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON output #91
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds JSON output functionality to the GROBID client, allowing users to convert TEI XML output to structured JSON format similar to CORD-19. The implementation includes a new --json command-line flag and corresponding json_output parameter for the Python API.
Key changes:
- Added
--jsoncommand-line flag andjson_outputparameter throughout the processing pipeline - Implemented
TEI2LossyJSONConverterclass for converting TEI XML to structured JSON - Updated documentation with JSON output examples and format specification
Reviewed Changes
Copilot reviewed 3 out of 5 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| grobid_client/grobid_client.py | Added json_output parameter throughout processing pipeline and JSON conversion logic |
| grobid_client/format/TEI2LossyJSON.py | New converter class for TEI XML to JSON transformation with streaming and batch processing capabilities |
| Readme.md | Added comprehensive documentation for JSON output feature with examples and format specification |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 3 out of 5 changed files in this pull request and generated 4 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| from bs4 import BeautifulSoup | ||
| import dateparser | ||
| # Reuse existing top-level helpers from this module by importing here | ||
| from grobid_client.format.TEI2LossyJSON import box_to_dict, get_random_id, get_formatted_passage, get_refs_with_offsets, xml_table_to_json |
Copilot
AI
Oct 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This circular import could cause issues. Consider refactoring helper functions into a separate utilities module or importing the current module differently.
| if ref_text in text: | ||
| start_offset = text.find(ref_text) |
Copilot
AI
Oct 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using text.find(ref_text) may return the wrong offset if the same reference text appears multiple times in the passage. This could lead to incorrect offset calculations for references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 3 out of 5 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| from typing import Tuple | ||
| import copy | ||
|
|
||
| from .format.TEI2LossyJSON import TEI2LossyJSONConverter |
Copilot
AI
Oct 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding import validation or error handling for the TEI2LossyJSON module to gracefully handle cases where dependencies might be missing.
| from .format.TEI2LossyJSON import TEI2LossyJSONConverter | |
| try: | |
| from .format.TEI2LossyJSON import TEI2LossyJSONConverter | |
| except ImportError as e: | |
| TEI2LossyJSONConverter = None | |
| import warnings | |
| warnings.warn( | |
| "Optional dependency 'TEI2LossyJSONConverter' could not be imported. " | |
| "Some features may not be available. Original error: {}".format(e), | |
| ImportWarning | |
| ) |
…uced because already existing
The new parameter
--jsonallow to output both XML TEI and JSON files.Markdown format to come
Example:
science.abj2096.grobid.tei.xml
science.abj2096.json
/cc @ericjeangirard