Fix/remove angle brackets in prompt schema by XiaoBoAI · Pull Request #122 · agentscope-ai/OpenJudge

XiaoBoAI · 2026-02-12T08:04:27Z

OpenJudge Version

[The version of OpenJudge you are working on, e.g. import openjudge; print(openjudge.__version__)]

Description

[Please describe the background, purpose, changes made, and how to test this PR]

Checklist

Please check the following items before code is ready to be reviewed.

Code has been formatted with pre-commit run --all-files command
All tests are passing
Docstrings are in Google style
Related documentation has been updated (e.g. links, examples, etc.)
Code is ready for review

…templates Angle bracket placeholders like `<integer between 1 and 5>` in Output Schema sections could be misinterpreted as XML tags by LLMs, causing malformed output. Replaced with plain text descriptions to avoid ambiguity with surrounding XML tags. Co-authored-by: Cursor <cursoragent@cursor.com>

Text graders previously assumed English (space-delimited) input, causing all word-level metrics (BLEU, ROUGE, F1, Jaccard, etc.) to fail on CJK text. This commit adds comprehensive multilingual support: - tokenization.py: add jieba-based smart_tokenize with 3 cut modes (accurate/search/all), built-in Chinese stop-word filtering, and CJK punctuation handling - normalization.py: fix normalize_text_advanced deleting all CJK chars ([^a-zA-Z0-9] -> [\W\s]), extend punctuation sets to include CJK - compute.py: replace all text.split() with smart_tokenize; CJK text automatically uses word_tokenize with stop-word/punctuation removal for BLEU, GLEU, METEOR, ROUGE, F1, cosine, Jaccard metrics - string_match_compute.py: fix word_overlap to use smart_tokenize - __init__.py: export new contains_cjk and smart_tokenize utilities Co-authored-by: Cursor <cursoragent@cursor.com>

gemini-code-assist · 2026-02-12T08:04:55Z

Summary of Changes

Hello @XiaoBoAI, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the clarity of prompt schemas by removing unnecessary angle brackets from JSON output examples, making them more precise for language model interactions. More significantly, it introduces extensive CJK language support into the core text processing utilities, enabling more accurate and culturally aware evaluation metrics and text manipulations for multilingual content. These enhancements ensure that text analysis functions behave correctly and effectively across both English and CJK languages.

Highlights

Prompt Schema Updates: Removed angle brackets from placeholder values within JSON output schema examples across various multi-turn and multimodal graders. This clarifies the expected output format for language models.
Comprehensive CJK Language Support: Introduced robust support for Chinese, Japanese, and Korean (CJK) text processing across text utility functions. This includes language-aware tokenization, normalization, and enhanced metric computations (BLEU, GLEU, METEOR, ROUGE, F1, Jaccard, TF-IDF, word overlap).
Advanced Tokenization and Normalization: Implemented jieba integration for CJK word segmentation, CJK-specific punctuation and stop-word handling, and Unicode-aware regex for text normalization, significantly improving multilingual text analysis capabilities.

Changelog

openjudge/graders/multi_turn/anaphora_resolution_grader.py
- Removed angle brackets from JSON output schema examples in both Chinese and English prompts.
openjudge/graders/multi_turn/context_memory_grader.py
- Removed angle brackets from JSON output schema examples in both Chinese and English prompts.
openjudge/graders/multi_turn/instruction_clarification_grader.py
- Removed angle brackets from JSON output schema examples in both Chinese and English prompts.
openjudge/graders/multi_turn/proactive_interaction_grader.py
- Removed angle brackets from JSON output schema examples in both Chinese and English prompts.
openjudge/graders/multi_turn/response_repetition_grader.py
- Removed angle brackets from JSON output schema examples in both Chinese and English prompts.
openjudge/graders/multi_turn/self_correction_grader.py
- Removed angle brackets from JSON output schema examples in both Chinese and English prompts.
openjudge/graders/multi_turn/topic_switch_grader.py
- Removed angle brackets from JSON output schema examples in both Chinese and English prompts.
openjudge/graders/multimodal/image_coherence.py
- Removed angle brackets from JSON output schema examples in both English and Chinese prompts.
openjudge/graders/multimodal/image_helpfulness.py
- Removed angle brackets from JSON output schema examples in both English and Chinese prompts.
openjudge/graders/multimodal/text_to_image.py
- Removed angle brackets from JSON output schema examples in both English and Chinese prompts for various score types.
openjudge/graders/text/_utils/init.py
- Imported and exposed contains_cjk and smart_tokenize functions.
openjudge/graders/text/_utils/compute.py
- Integrated CJK-aware tokenization into BLEU, GLEU, METEOR, ROUGE, F1, Jaccard, TF-IDF, and word overlap computations.
- Added _cjk_tfidf_tokenizer for TF-IDF processing of CJK text.
- Updated _token_sort_ratio and _cosine_simple to use smart tokenization.
- Added documentation to reflect CJK support in various compute functions.
openjudge/graders/text/_utils/normalization.py
- Defined _CJK_PUNCTUATION and _ALL_PUNCTUATION sets for comprehensive punctuation handling.
- Enhanced normalize_text and remove_punctuation to correctly handle CJK punctuation.
- Improved normalize_text_advanced to use NFKC Unicode normalization and re.UNICODE for special character removal, preserving CJK characters.
openjudge/graders/text/_utils/string_match_compute.py
- Integrated CJK-aware tokenization into compute_word_overlap and _ngram_tokenize functions.
- Added documentation to highlight CJK support.
openjudge/graders/text/_utils/tokenization.py
- Introduced contains_cjk for detecting CJK characters.
- Implemented jieba integration for CJK word segmentation with various cut modes.
- Defined CJK-specific punctuation and a default Chinese stop-word list.
- Created smart_tokenize as a language-aware tokenization entry point.
- Updated simple_tokenize, word_tokenize, ngram_tokenize, sentence_tokenize, tokenize_preserving_case, and whitespace_tokenize to be language-aware and support CJK text.
- Exported new CJK-related functions and constants in __all__.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request primarily focuses on two areas: fixing prompt schemas by removing angle brackets and introducing comprehensive CJK language support into the text utility functions. The prompt schema fixes are straightforward and improve clarity. The addition of CJK support is a significant and valuable enhancement, and the implementation appears robust. My review identifies an opportunity to improve maintainability by refactoring some duplicated constants related to punctuation. The specific comments below detail this suggested refactoring.

gemini-code-assist · 2026-02-12T08:07:34Z

openjudge/graders/text/_utils/tokenization.py

+)
+
+# Combined punctuation (ASCII + CJK)
+_ALL_PUNCTUATION = set(string.punctuation) | CJK_PUNCTUATION


To allow this constant to be reused in other modules like normalization.py and avoid code duplication, it's better to make it public. Please rename _ALL_PUNCTUATION to ALL_PUNCTUATION.

Suggested change

_ALL_PUNCTUATION = set(string.punctuation) | CJK_PUNCTUATION

ALL_PUNCTUATION = set(string.punctuation) | CJK_PUNCTUATION

gemini-code-assist · 2026-02-12T08:07:34Z

openjudge/graders/text/_utils/tokenization.py

+        t_stripped = t.strip()
+        if not t_stripped:
+            continue
+        if remove_punctuation and (t_stripped in _ALL_PUNCTUATION or all(ch in _ALL_PUNCTUATION for ch in t_stripped)):


Following the suggestion to make _ALL_PUNCTUATION public, please update its usage here to ALL_PUNCTUATION.

Suggested change

if remove_punctuation and (t_stripped in _ALL_PUNCTUATION or all(ch in _ALL_PUNCTUATION for ch in t_stripped)):

if remove_punctuation and (t_stripped in ALL_PUNCTUATION or all(ch in ALL_PUNCTUATION for ch in t_stripped)):

gemini-code-assist · 2026-02-12T08:07:34Z

openjudge/graders/text/_utils/tokenization.py

+    "CJK_PUNCTUATION",
 ]


To make ALL_PUNCTUATION importable from other modules, please add it to __all__.

Suggested change

"CJK_PUNCTUATION",

]

"CJK_PUNCTUATION",

"ALL_PUNCTUATION",

]

gemini-code-assist · 2026-02-12T08:07:34Z

openjudge/graders/text/_utils/normalization.py

+# CJK fullwidth and common punctuation
+_CJK_PUNCTUATION = set(
+    "\u3001\u3002"  # 、。
+    "\uff01\uff1f"  # ！？
+    "\uff0c\uff1b\uff1a"  # ，；：
+    "\u201c\u201d"  # ""
+    "\u2018\u2019"  # ''
+    "\u3010\u3011"  # 【】
+    "\uff08\uff09"  # （）
+    "\u300a\u300b"  # 《》
+    "\u2026\u2014\uff5e\u00b7"  # …—～·
+    "\u300c\u300d"  # 「」
+    "\u300e\u300f"  # 『』
+    "\u3008\u3009"  # 〈〉
+    "\u3014\u3015"  # 〔〕
+)
+_ALL_PUNCTUATION = set(string.punctuation) | _CJK_PUNCTUATION


This set of punctuation constants is also defined in tokenization.py. To avoid code duplication and ensure consistency, please remove this definition and import ALL_PUNCTUATION from openjudge.graders.text._utils.tokenization instead. You can alias it to _ALL_PUNCTUATION to match the existing usage in this file.

(Please move the import statement to the top of the file with other imports.)

from openjudge.graders.text._utils.tokenization import ALL_PUNCTUATION _ALL_PUNCTUATION = ALL_PUNCTUATION

Simplify the tokenization architecture by using jieba as the single tokenization backend for all text (English, Chinese, mixed). This removes the need for CJK detection branching — jieba natively handles English words, numbers, and CJK text uniformly. Key changes: - tokenization.py: remove contains_cjk/text.split() dual paths; all functions now go through jieba directly - compute.py: remove all is_cjk branching; unified word_tokenize calls for BLEU, GLEU, METEOR, ROUGE, F1, cosine, Jaccard - ROUGE scorer uses jieba tokenizer + optional Porter stemmer for English stemming support - string_match_compute.py: simplified word_overlap via word_tokenize - __init__.py: cleaned up exports Co-authored-by: Cursor <cursoragent@cursor.com>

XiaoBoAI and others added 2 commits February 12, 2026 15:59

gemini-code-assist bot reviewed Feb 12, 2026

View reviewed changes

XieLipeng0830 approved these changes Feb 12, 2026

View reviewed changes

helloml0326 approved these changes Feb 12, 2026

View reviewed changes

helloml0326 merged commit 2cd5ef3 into main Feb 12, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/remove angle brackets in prompt schema#122

Fix/remove angle brackets in prompt schema#122
helloml0326 merged 3 commits intomainfrom
fix/remove-angle-brackets-in-prompt-schema

XiaoBoAI commented Feb 12, 2026

Uh oh!

gemini-code-assist bot commented Feb 12, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 12, 2026

Uh oh!

gemini-code-assist bot Feb 12, 2026

Uh oh!

gemini-code-assist bot Feb 12, 2026

Uh oh!

gemini-code-assist bot Feb 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

	_ALL_PUNCTUATION = set(string.punctuation) \| CJK_PUNCTUATION
	ALL_PUNCTUATION = set(string.punctuation) \| CJK_PUNCTUATION

	if remove_punctuation and (t_stripped in _ALL_PUNCTUATION or all(ch in _ALL_PUNCTUATION for ch in t_stripped)):
	if remove_punctuation and (t_stripped in ALL_PUNCTUATION or all(ch in ALL_PUNCTUATION for ch in t_stripped)):

Conversation

XiaoBoAI commented Feb 12, 2026

OpenJudge Version

Description

Checklist

Uh oh!

gemini-code-assist bot commented Feb 12, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments