Fix/remove angle brackets in prompt schema#122
Conversation
…templates Angle bracket placeholders like `<integer between 1 and 5>` in Output Schema sections could be misinterpreted as XML tags by LLMs, causing malformed output. Replaced with plain text descriptions to avoid ambiguity with surrounding XML tags. Co-authored-by: Cursor <cursoragent@cursor.com>
Text graders previously assumed English (space-delimited) input, causing all word-level metrics (BLEU, ROUGE, F1, Jaccard, etc.) to fail on CJK text. This commit adds comprehensive multilingual support: - tokenization.py: add jieba-based smart_tokenize with 3 cut modes (accurate/search/all), built-in Chinese stop-word filtering, and CJK punctuation handling - normalization.py: fix normalize_text_advanced deleting all CJK chars ([^a-zA-Z0-9] -> [\W\s]), extend punctuation sets to include CJK - compute.py: replace all text.split() with smart_tokenize; CJK text automatically uses word_tokenize with stop-word/punctuation removal for BLEU, GLEU, METEOR, ROUGE, F1, cosine, Jaccard metrics - string_match_compute.py: fix word_overlap to use smart_tokenize - __init__.py: export new contains_cjk and smart_tokenize utilities Co-authored-by: Cursor <cursoragent@cursor.com>
Summary of ChangesHello @XiaoBoAI, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refines the clarity of prompt schemas by removing unnecessary angle brackets from JSON output examples, making them more precise for language model interactions. More significantly, it introduces extensive CJK language support into the core text processing utilities, enabling more accurate and culturally aware evaluation metrics and text manipulations for multilingual content. These enhancements ensure that text analysis functions behave correctly and effectively across both English and CJK languages. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request primarily focuses on two areas: fixing prompt schemas by removing angle brackets and introducing comprehensive CJK language support into the text utility functions. The prompt schema fixes are straightforward and improve clarity. The addition of CJK support is a significant and valuable enhancement, and the implementation appears robust. My review identifies an opportunity to improve maintainability by refactoring some duplicated constants related to punctuation. The specific comments below detail this suggested refactoring.
| ) | ||
|
|
||
| # Combined punctuation (ASCII + CJK) | ||
| _ALL_PUNCTUATION = set(string.punctuation) | CJK_PUNCTUATION |
There was a problem hiding this comment.
To allow this constant to be reused in other modules like normalization.py and avoid code duplication, it's better to make it public. Please rename _ALL_PUNCTUATION to ALL_PUNCTUATION.
| _ALL_PUNCTUATION = set(string.punctuation) | CJK_PUNCTUATION | |
| ALL_PUNCTUATION = set(string.punctuation) | CJK_PUNCTUATION |
| t_stripped = t.strip() | ||
| if not t_stripped: | ||
| continue | ||
| if remove_punctuation and (t_stripped in _ALL_PUNCTUATION or all(ch in _ALL_PUNCTUATION for ch in t_stripped)): |
There was a problem hiding this comment.
Following the suggestion to make _ALL_PUNCTUATION public, please update its usage here to ALL_PUNCTUATION.
| if remove_punctuation and (t_stripped in _ALL_PUNCTUATION or all(ch in _ALL_PUNCTUATION for ch in t_stripped)): | |
| if remove_punctuation and (t_stripped in ALL_PUNCTUATION or all(ch in ALL_PUNCTUATION for ch in t_stripped)): |
| "CJK_PUNCTUATION", | ||
| ] |
| # CJK fullwidth and common punctuation | ||
| _CJK_PUNCTUATION = set( | ||
| "\u3001\u3002" # 、。 | ||
| "\uff01\uff1f" # !? | ||
| "\uff0c\uff1b\uff1a" # ,;: | ||
| "\u201c\u201d" # "" | ||
| "\u2018\u2019" # '' | ||
| "\u3010\u3011" # 【】 | ||
| "\uff08\uff09" # () | ||
| "\u300a\u300b" # 《》 | ||
| "\u2026\u2014\uff5e\u00b7" # …—~· | ||
| "\u300c\u300d" # 「」 | ||
| "\u300e\u300f" # 『』 | ||
| "\u3008\u3009" # 〈〉 | ||
| "\u3014\u3015" # 〔〕 | ||
| ) | ||
| _ALL_PUNCTUATION = set(string.punctuation) | _CJK_PUNCTUATION |
There was a problem hiding this comment.
This set of punctuation constants is also defined in tokenization.py. To avoid code duplication and ensure consistency, please remove this definition and import ALL_PUNCTUATION from openjudge.graders.text._utils.tokenization instead. You can alias it to _ALL_PUNCTUATION to match the existing usage in this file.
(Please move the import statement to the top of the file with other imports.)
from openjudge.graders.text._utils.tokenization import ALL_PUNCTUATION
_ALL_PUNCTUATION = ALL_PUNCTUATIONSimplify the tokenization architecture by using jieba as the single tokenization backend for all text (English, Chinese, mixed). This removes the need for CJK detection branching — jieba natively handles English words, numbers, and CJK text uniformly. Key changes: - tokenization.py: remove contains_cjk/text.split() dual paths; all functions now go through jieba directly - compute.py: remove all is_cjk branching; unified word_tokenize calls for BLEU, GLEU, METEOR, ROUGE, F1, cosine, Jaccard - ROUGE scorer uses jieba tokenizer + optional Porter stemmer for English stemming support - string_match_compute.py: simplified word_overlap via word_tokenize - __init__.py: cleaned up exports Co-authored-by: Cursor <cursoragent@cursor.com>
OpenJudge Version
[The version of OpenJudge you are working on, e.g.
import openjudge; print(openjudge.__version__)]Description
[Please describe the background, purpose, changes made, and how to test this PR]
Checklist
Please check the following items before code is ready to be reviewed.
pre-commit run --all-filescommand