Skip to content

Fix/remove angle brackets in prompt schema#122

Merged
helloml0326 merged 3 commits intomainfrom
fix/remove-angle-brackets-in-prompt-schema
Feb 12, 2026
Merged

Fix/remove angle brackets in prompt schema#122
helloml0326 merged 3 commits intomainfrom
fix/remove-angle-brackets-in-prompt-schema

Conversation

@XiaoBoAI
Copy link
Collaborator

OpenJudge Version

[The version of OpenJudge you are working on, e.g. import openjudge; print(openjudge.__version__)]

Description

[Please describe the background, purpose, changes made, and how to test this PR]

Checklist

Please check the following items before code is ready to be reviewed.

  • Code has been formatted with pre-commit run --all-files command
  • All tests are passing
  • Docstrings are in Google style
  • Related documentation has been updated (e.g. links, examples, etc.)
  • Code is ready for review

XiaoBoAI and others added 2 commits February 12, 2026 15:59
…templates

Angle bracket placeholders like `<integer between 1 and 5>` in Output Schema
sections could be misinterpreted as XML tags by LLMs, causing malformed output.
Replaced with plain text descriptions to avoid ambiguity with surrounding XML tags.

Co-authored-by: Cursor <cursoragent@cursor.com>
Text graders previously assumed English (space-delimited) input, causing
all word-level metrics (BLEU, ROUGE, F1, Jaccard, etc.) to fail on CJK
text. This commit adds comprehensive multilingual support:

- tokenization.py: add jieba-based smart_tokenize with 3 cut modes
  (accurate/search/all), built-in Chinese stop-word filtering, and
  CJK punctuation handling
- normalization.py: fix normalize_text_advanced deleting all CJK chars
  ([^a-zA-Z0-9] -> [\W\s]), extend punctuation sets to include CJK
- compute.py: replace all text.split() with smart_tokenize; CJK text
  automatically uses word_tokenize with stop-word/punctuation removal
  for BLEU, GLEU, METEOR, ROUGE, F1, cosine, Jaccard metrics
- string_match_compute.py: fix word_overlap to use smart_tokenize
- __init__.py: export new contains_cjk and smart_tokenize utilities

Co-authored-by: Cursor <cursoragent@cursor.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @XiaoBoAI, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the clarity of prompt schemas by removing unnecessary angle brackets from JSON output examples, making them more precise for language model interactions. More significantly, it introduces extensive CJK language support into the core text processing utilities, enabling more accurate and culturally aware evaluation metrics and text manipulations for multilingual content. These enhancements ensure that text analysis functions behave correctly and effectively across both English and CJK languages.

Highlights

  • Prompt Schema Updates: Removed angle brackets from placeholder values within JSON output schema examples across various multi-turn and multimodal graders. This clarifies the expected output format for language models.
  • Comprehensive CJK Language Support: Introduced robust support for Chinese, Japanese, and Korean (CJK) text processing across text utility functions. This includes language-aware tokenization, normalization, and enhanced metric computations (BLEU, GLEU, METEOR, ROUGE, F1, Jaccard, TF-IDF, word overlap).
  • Advanced Tokenization and Normalization: Implemented jieba integration for CJK word segmentation, CJK-specific punctuation and stop-word handling, and Unicode-aware regex for text normalization, significantly improving multilingual text analysis capabilities.
Changelog
  • openjudge/graders/multi_turn/anaphora_resolution_grader.py
    • Removed angle brackets from JSON output schema examples in both Chinese and English prompts.
  • openjudge/graders/multi_turn/context_memory_grader.py
    • Removed angle brackets from JSON output schema examples in both Chinese and English prompts.
  • openjudge/graders/multi_turn/instruction_clarification_grader.py
    • Removed angle brackets from JSON output schema examples in both Chinese and English prompts.
  • openjudge/graders/multi_turn/proactive_interaction_grader.py
    • Removed angle brackets from JSON output schema examples in both Chinese and English prompts.
  • openjudge/graders/multi_turn/response_repetition_grader.py
    • Removed angle brackets from JSON output schema examples in both Chinese and English prompts.
  • openjudge/graders/multi_turn/self_correction_grader.py
    • Removed angle brackets from JSON output schema examples in both Chinese and English prompts.
  • openjudge/graders/multi_turn/topic_switch_grader.py
    • Removed angle brackets from JSON output schema examples in both Chinese and English prompts.
  • openjudge/graders/multimodal/image_coherence.py
    • Removed angle brackets from JSON output schema examples in both English and Chinese prompts.
  • openjudge/graders/multimodal/image_helpfulness.py
    • Removed angle brackets from JSON output schema examples in both English and Chinese prompts.
  • openjudge/graders/multimodal/text_to_image.py
    • Removed angle brackets from JSON output schema examples in both English and Chinese prompts for various score types.
  • openjudge/graders/text/_utils/init.py
    • Imported and exposed contains_cjk and smart_tokenize functions.
  • openjudge/graders/text/_utils/compute.py
    • Integrated CJK-aware tokenization into BLEU, GLEU, METEOR, ROUGE, F1, Jaccard, TF-IDF, and word overlap computations.
    • Added _cjk_tfidf_tokenizer for TF-IDF processing of CJK text.
    • Updated _token_sort_ratio and _cosine_simple to use smart tokenization.
    • Added documentation to reflect CJK support in various compute functions.
  • openjudge/graders/text/_utils/normalization.py
    • Defined _CJK_PUNCTUATION and _ALL_PUNCTUATION sets for comprehensive punctuation handling.
    • Enhanced normalize_text and remove_punctuation to correctly handle CJK punctuation.
    • Improved normalize_text_advanced to use NFKC Unicode normalization and re.UNICODE for special character removal, preserving CJK characters.
  • openjudge/graders/text/_utils/string_match_compute.py
    • Integrated CJK-aware tokenization into compute_word_overlap and _ngram_tokenize functions.
    • Added documentation to highlight CJK support.
  • openjudge/graders/text/_utils/tokenization.py
    • Introduced contains_cjk for detecting CJK characters.
    • Implemented jieba integration for CJK word segmentation with various cut modes.
    • Defined CJK-specific punctuation and a default Chinese stop-word list.
    • Created smart_tokenize as a language-aware tokenization entry point.
    • Updated simple_tokenize, word_tokenize, ngram_tokenize, sentence_tokenize, tokenize_preserving_case, and whitespace_tokenize to be language-aware and support CJK text.
    • Exported new CJK-related functions and constants in __all__.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request primarily focuses on two areas: fixing prompt schemas by removing angle brackets and introducing comprehensive CJK language support into the text utility functions. The prompt schema fixes are straightforward and improve clarity. The addition of CJK support is a significant and valuable enhancement, and the implementation appears robust. My review identifies an opportunity to improve maintainability by refactoring some duplicated constants related to punctuation. The specific comments below detail this suggested refactoring.

)

# Combined punctuation (ASCII + CJK)
_ALL_PUNCTUATION = set(string.punctuation) | CJK_PUNCTUATION
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To allow this constant to be reused in other modules like normalization.py and avoid code duplication, it's better to make it public. Please rename _ALL_PUNCTUATION to ALL_PUNCTUATION.

Suggested change
_ALL_PUNCTUATION = set(string.punctuation) | CJK_PUNCTUATION
ALL_PUNCTUATION = set(string.punctuation) | CJK_PUNCTUATION

t_stripped = t.strip()
if not t_stripped:
continue
if remove_punctuation and (t_stripped in _ALL_PUNCTUATION or all(ch in _ALL_PUNCTUATION for ch in t_stripped)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Following the suggestion to make _ALL_PUNCTUATION public, please update its usage here to ALL_PUNCTUATION.

Suggested change
if remove_punctuation and (t_stripped in _ALL_PUNCTUATION or all(ch in _ALL_PUNCTUATION for ch in t_stripped)):
if remove_punctuation and (t_stripped in ALL_PUNCTUATION or all(ch in ALL_PUNCTUATION for ch in t_stripped)):

Comment on lines +560 to 561
"CJK_PUNCTUATION",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To make ALL_PUNCTUATION importable from other modules, please add it to __all__.

Suggested change
"CJK_PUNCTUATION",
]
"CJK_PUNCTUATION",
"ALL_PUNCTUATION",
]

Comment on lines +14 to +30
# CJK fullwidth and common punctuation
_CJK_PUNCTUATION = set(
"\u3001\u3002" # 、。
"\uff01\uff1f" # !?
"\uff0c\uff1b\uff1a" # ,;:
"\u201c\u201d" # ""
"\u2018\u2019" # ''
"\u3010\u3011" # 【】
"\uff08\uff09" # ()
"\u300a\u300b" # 《》
"\u2026\u2014\uff5e\u00b7" # …—~·
"\u300c\u300d" # 「」
"\u300e\u300f" # 『』
"\u3008\u3009" # 〈〉
"\u3014\u3015" # 〔〕
)
_ALL_PUNCTUATION = set(string.punctuation) | _CJK_PUNCTUATION
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This set of punctuation constants is also defined in tokenization.py. To avoid code duplication and ensure consistency, please remove this definition and import ALL_PUNCTUATION from openjudge.graders.text._utils.tokenization instead. You can alias it to _ALL_PUNCTUATION to match the existing usage in this file.

(Please move the import statement to the top of the file with other imports.)

from openjudge.graders.text._utils.tokenization import ALL_PUNCTUATION

_ALL_PUNCTUATION = ALL_PUNCTUATION

Simplify the tokenization architecture by using jieba as the single
tokenization backend for all text (English, Chinese, mixed). This
removes the need for CJK detection branching — jieba natively handles
English words, numbers, and CJK text uniformly.

Key changes:
- tokenization.py: remove contains_cjk/text.split() dual paths;
  all functions now go through jieba directly
- compute.py: remove all is_cjk branching; unified word_tokenize
  calls for BLEU, GLEU, METEOR, ROUGE, F1, cosine, Jaccard
- ROUGE scorer uses jieba tokenizer + optional Porter stemmer for
  English stemming support
- string_match_compute.py: simplified word_overlap via word_tokenize
- __init__.py: cleaned up exports

Co-authored-by: Cursor <cursoragent@cursor.com>
@helloml0326 helloml0326 merged commit 2cd5ef3 into main Feb 12, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments