Skip to content

Conversation

@frankslin
Copy link
Owner

  • Add description_zh field with Traditional Chinese descriptions to all 14 config files
  • Add examples field with typical conversion examples for each config
  • Add config-examples.test.js to validate conversion examples
  • Examples demonstrate differences between variants (e.g., s2tw vs s2twp)

Example conversions:

  • s2t: "鼠标" → "鼠標" (standard Traditional)
  • s2tw: "鼠标" → "滑鼠" (Taiwan standard)
  • s2twp: "数据库" → "資料庫" (Taiwan with phrases)

Also: clarify Traditional Chinese as OpenCC Standard in 2t and t2 configs

Update all 2t and t2 config files, except jp2t and t2jp, to specify 'OpenCC標準繁體' (OpenCC Standard Traditional Chinese) instead of just '繁體' (Traditional Chinese). This clarifies that it represents an intermediate standard format, not a final user-facing variant.

Modified configs:

  • s2t.json: 簡體到繁體 → 簡體到OpenCC標準繁體
  • t2s.json: 繁體到簡體 → OpenCC標準繁體到簡體
  • hk2t.json: 香港繁體到繁體 → 香港繁體到OpenCC標準繁體
  • t2hk.json: 繁體到香港繁體 → OpenCC標準繁體到香港繁體
  • t2tw.json: 繁體到台灣正體 → OpenCC標準繁體到台灣正體
  • tw2t.json: 台灣正體到繁體 → 台灣正體到OpenCC標準繁體

Note: The 't' in jp2t/t2jp refers to Japanese Kyūjitai (舊字體), not Chinese OpenCC Standard Traditional, so these should not be labeled as 'OpenCC標準繁體'.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the OpenCC configuration files by adding Traditional Chinese descriptions and conversion examples to all 14 config files. It also clarifies terminology by specifying "OpenCC Standard Traditional Chinese" for intermediate 2t and t2 conversions (except Japanese variants).

Key Changes

  • Adds description_zh field with Traditional Chinese descriptions to all config files
  • Adds examples field with input/output conversion examples demonstrating variant differences
  • Adds test file config-examples.test.js to validate conversion examples
  • Updates config names to clarify "OpenCC Standard Traditional Chinese" terminology

Reviewed changes

Copilot reviewed 18 out of 35 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
wasm-lib/test/config-examples.test.js New test file to validate conversion examples from config files
wasm-lib/test/check-conversions.js Helper script for manually checking conversions (development tool)
wasm-lib/scripts/find-variant-examples.js Helper script for finding character variants (development tool)
wasm-lib/package.json Updates test script to include new config-examples test
wasm-lib/data/config/*.json (14 files) Adds description_zh and examples fields to all config files
wasm-lib/dist/data/config/*.json (14 files) Mirrored changes to dist config files
wasm-lib/dist/esm/index.js Updates module path resolution logic
wasm-lib/dist/esm/opencc-wasm.js Minified WASM glue code with updated file reference
wasm-lib/dist/cjs/index.cjs Updates CommonJS module path resolution

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +16 to +20
function getConverter(config) {
if (!converterCache.has(config)) {
converterCache.set(config, OpenCC.Converter({ config }));
}
return converterCache.get(config);
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The converter is being cached and reused across multiple async tests. Since OpenCC.Converter() returns a promise (as indicated by await convert(example.input) on line 37), there's a potential race condition where the converter might not be properly initialized before being used in a test.

The cache should store the result of awaiting the converter promise, not the promise itself. Consider changing line 18 to:
converterCache.set(config, await OpenCC.Converter({ config }));

And make the function async:
async function getConverter(config) { ... }

Copilot uses AI. Check for mistakes.
// Test each example
config.examples.forEach((example, idx) => {
test(`[${configFile}] ${config.description_zh || config.name} - example #${idx + 1}: "${example.input}"`, async () => {
const convert = getConverter(configFile);
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is created at the top level (outside of a describe block) by iterating over config files synchronously, but the test callback is async. The getConverter() call should be awaited since it needs to be async to properly await the converter initialization.

Change line 36 to:
const convert = await getConverter(configFile);

Copilot uses AI. Check for mistakes.
frankslin and others added 9 commits January 3, 2026 07:27
* Add WASM demo scaffold and project notes
* Add OpenCC WASM demo with converter UI and test runner
  - 补充 WASM 编译结果在前端 JS 中的用法
* Polish WASM demo UI and paths, run tests, and streamline converter export
* Add wasm-based OpenCC package and update demo to consume it
* Add wasm-based OpenCC package, static demo bundle, and benchmarking page
* Add copyright notice and LICENSE
…eparation

This commit enhances the opencc-wasm library with TypeScript support and
implements a cleaner build architecture with semantic separation between
intermediate build artifacts and publishable distribution.

TypeScript Support:
- Add comprehensive type definitions (index.d.ts) with full JSDoc documentation
- Define interfaces: ConverterOptions, ConverterFunction, OpenCCNamespace, etc.
- Provide complete type safety for better IDE support and developer experience

Build Architecture Redesign (semantic separation):
- build/ - Intermediate WASM artifacts (gitignored, for tests/development)
  * build/opencc-wasm.esm.js - ESM WASM glue
  * build/opencc-wasm.cjs - CJS WASM glue
  * build/opencc-wasm.wasm - WASM binary
- dist/ - Publishable distribution (committed, for npm)
  * dist/esm/ - ESM package entry
  * dist/cjs/ - CJS package entry
  * dist/data/ - OpenCC config and dictionary files

Invariants and Semantics:
- Tests import source (index.js) → loads from build/
- Published package exports dist/ only
- build/ = internal intermediate artifacts
- dist/ = publishable artifacts
- Clear separation ensures tests validate actual build output

Enhanced .gitignore:
- Add build/ to gitignore (intermediate artifacts)
- Add node_modules/, logs, OS-specific files (.DS_Store, Thumbs.db)
- Exclude editor configurations (.vscode/, .idea/)
- Add cache and temporary file exclusions

Two-Stage Build Process:
Stage 1 (build.sh):
  - Compiles C++ to WASM using Emscripten
  - Outputs to build/ directory

Stage 2 (build-api.js):
  - Copies WASM artifacts from build/ to dist/
  - Transforms source paths for production
  - Generates API wrappers for ESM and CJS
  - Copies data files

Package Configuration (package.json):
- Add "types" field pointing to index.d.ts
- Update "main" and "module" to point to API wrappers in dist/
- Add comprehensive "exports" map:
  * "." - Main API (ESM/CJS wrappers)
  * "./wasm" - Direct access to WASM glue for advanced users
  * "./dist/*" - Wildcard for flexible file access
- Include LICENSE and NOTICE in published files

Documentation:
- Add comprehensive README section explaining build architecture
- Document project structure with invariants
- Explain semantic separation between build/ and dist/

Benefits:
- Better TypeScript integration and IDE autocomplete
- Cleaner, more maintainable directory structure
- Tests validate actual build output, not stale dist files
- Clear semantic separation between internal and publishable artifacts
- Professional project setup following modern npm best practices
- Long-term maintainability through clear invariants
## Summary
- add a `//data/config:config_dict_validation_test` to test dictionaries and configs against a `testcases.json` file
- switch all CLI/Python/Node tests to consume `testcases.json` as the single source of truth; drop `.in/.ans` dependencies and adjust Bazel/CMake wiring
- streamline dictionary build outputs (no standalone `TWPhrases{IT,Name,Other}.ocd2`) and align DictionaryTest with the actual generated dict set
- add maintenance helpers (refresh_assets.sh cleanup and fix, rapidjson dep/path for CLI test) and keep wasm assets in sync via `testcases.json`

## Testing
- bazel test //data/dictionary:dictionary_test
- bazel test //test:command_line_converter_test
- bazel test //python/tests:test_opencc
- node/test.js (sync/async/promise) using updated testcases.json
----

* feature: add a new ConfigDictValidationTest.cpp to be executed in bazel
* Changeover to JSON-based testcases and clean dictionary outputs
  - Switch all tests (C++ CLI, Python, Node) to consume `testcases.json` and drop `.in`/`.ans` dependencies; keep filegroup for the JSON.
  - Prune TWPhrases sub-dictionary artifacts and align DictionaryTest to current generated dict set.
  - Add rapidjson dep/path for CLI test, refresh_assets script fixes, and keep Bazel Python toolchain note.
* Normalize CommandLineConvertTest for CRLF comparisons on Windows
* Address review feedback for tests and Bazel-only validation
  - Rename and guard streams in CommandLineConvertTest; ensure input file opens and normalize CRLF.
  - Fix node test promise handling to propagate errors correctly.
  - Mark ConfigDictValidationTest as Bazel-only to skip CMake builds.
…cases.json (#10)

- add refresh_assets.sh to rebuild/copy only config-referenced .ocd2 files and testcases.json
- convert wasm-lib tests to consume the new `{cases:[...]}` JSON format
- update bundled .ocd2 dictionaries and testcases.json fixtures

----

* wasm-lib: refresh assets script and switch tests to consolidated testcases.json
  - add refresh_assets.sh to rebuild/copy only config-referenced .ocd2 files and testcases.json
  - convert wasm-lib tests to consume the new `{cases:[...]}` JSON format
  - update bundled .ocd2 dictionaries and testcases.json fixtures
* Rebuild the wasm-lib and update the documentations
新增完整的貢獻指南文檔,包含:
- 如何新增詞典條目(強調使用 Tab 字元分隔)
- 如何使用排序工具確保詞典正確排序
- 如何安裝 Bazel 並執行測試
- 如何撰寫測試案例(測試驅動開發流程)
- 簡轉繁轉換的特殊注意事項(需測試多個配置)

使用台灣繁體中文撰寫。
@frankslin frankslin force-pushed the claude/add-opencc-config-docs-3GMxx branch from 8291610 to 1ff1888 Compare January 3, 2026 13:37
claude and others added 2 commits January 3, 2026 05:56
1. 新增演算法與理論局限性分析文件
   - 詳細說明最大正向匹配分詞演算法
   - 分析轉換鏈機制與詞典系統
   - 探討理論局限性(一對多歧義、缺乏上下文理解、維護負擔)
   - 與現代方法(統計模型、神經網路)的比較

2. 更新 AGENTS.md
   - 新增「延伸閱讀」章節
   - 連結到技術文件和貢獻指南

3. 新增 Claude Code 配置
   - .claude/hooks/session_start.sh - 會話啟動時顯示專案資訊
   - .claude/skills/opencc-dict-edit.md - 詞典編輯技能
   - .claude/skills/opencc-algorithm-explain.md - 演算法解釋技能

這些配置幫助 AI 代理更好地理解 OpenCC 專案架構與開發流程。
@frankslin frankslin force-pushed the claude/add-opencc-config-docs-3GMxx branch 2 times, most recently from a6d37f8 to 7d1f7ed Compare January 3, 2026 22:57
claude and others added 5 commits January 3, 2026 20:43
🚨 BREAKING CHANGE: New distribution layout

The .wasm files have been moved to be co-located with their corresponding
glue code files, fixing loading issues and enabling proper CDN usage.

New layout:
  dist/
    esm/
      opencc-wasm.js
      opencc-wasm.wasm      ← Now here (same directory)
    cjs/
      opencc-wasm.cjs
      opencc-wasm.wasm      ← Now here (same directory)
    opencc-wasm.wasm        ← Kept for legacy compatibility

Features:
- ✅ CDN support: Can now import directly from jsDelivr/unpkg
- ✅ Fixed WASM loading in various bundlers and environments
- ✅ Comprehensive test suite with CDN usage tests
- ✅ Complete documentation (CDN_USAGE.md, TESTING.md, CHANGELOG.md)

Test suite:
- npm test         → Run all tests (core + CDN)
- npm run test:core → Run 56 core functionality tests
- npm run test:cdn  → Run CDN usage tests

All 56 core tests + CDN tests pass successfully.

Usage example:
```js
import OpenCC from "https://cdn.jsdelivr.net/npm/opencc-wasm@0.3.0/dist/esm/index.js";
const converter = OpenCC.Converter({ from: "cn", to: "t" });
const result = await converter("简体中文");
```

Co-authored-by: Claude <claude@anthropic.com>
- 在頭部新增「專案說明」章節,說明本項目為 BYVoid/OpenCC 的 fork
- 闡述兩個主要目的:WASM 實現與詞表擴充
- 新增「背景」小節,說明現有第三方實作的維護狀況與本專案定位
- 原有 README 內容完整保留在分隔線下方作為參考
This commit adds significant improvements to opencc-wasm:

**API Enhancements:**
- Add `config` parameter to Converter() as intuitive alternative to `from`/`to`
- Support direct OpenCC config file names (e.g., `{ config: "s2twp" }`)
- Expand CONFIG_MAP to support all conversion types and aliases
- Maintain backward compatibility with `from`/`to` parameters

**Documentation Improvements:**
- Consolidate all API documentation into comprehensive README.md
- Add Traditional Chinese README (README.zh-TW.md) with Taiwan localization
- Emphasize "zero configuration" and "3-line start" features
- Include practical examples for React, Vue, Node.js, and Web Workers
- Add best practices and FAQ sections
- Create interactive demo (test/demo-out-of-box.html)

**User Experience:**
- Clarify auto-loading of configs and dictionaries from CDN
- Show both API methods side-by-side for user choice
- Provide TypeScript usage examples

All 56 core tests + new config parameter tests passing.
- Add '方程式' to TWPhrasesOther.txt to prevent '程式' -> '程序' misconversion
- Add regression test case in testcases.json

Ref: BYVoid#714
frankslin and others added 14 commits January 4, 2026 10:32
修正「方程式」轉換錯誤與建立標準修復流程

本次更新主要解決了「方程式」在台灣繁體轉簡體(tw2sp)模式下被錯誤轉換為「方程序」的問題( BYVoid#714 ),並藉此建立了一套完整的修復活流程:

- 核心詞典修正:在 TWPhrasesOther.txt 中新增了「方程式」的顯式映射,確保其不被錯誤分詞或套用通用規則。
- 測試與驗證:在核心專案與 WASM 版本的測試集(testcases.json)中皆新增了回歸測試,確保修復有效且不再復發。
- WASM 同步:同步更新了 WASM 版本的二進位詞典檔案,確保網頁版功能與核心一致。
- 流程文件化:新增了 Claude 技能文件(opencc-fix-translation-workflow),完整記錄了從診斷錯誤、修改詞典、測試驗證到同步 WASM 函式庫的標準作業程序,以利日後維護。
Fix character duplication bug where "演算法" becomes "演演算法" when
using s2twp (Simplified to Traditional Taiwan with Phrases) conversion.

## Root Cause

The bug occurs in the TWPhrases conversion step due to character-level
longest prefix matching in Conversion::Convert():

Input: "演算法"
1. Match "演算法" → Not found in TWPhrases
2. Match "演" → Not found, keep "演", advance 1 char
3. Match "算法" → Found! Convert to "演算法"
Result: "演" + "演算法" = "演演算法" ✗

The TWPhrasesIT.txt contained only "算法→演算法" without an identity
mapping for "演算法", causing partial matches.

## Solution

Two-part fix to prevent duplication while maintaining correct reverse
conversion (tw2sp):

### 1. Dictionary Updates

- Add "演算法→演算法" to TWPhrasesIT.txt (line 261, before "算法→演算法")
  * Ensures "演算法" matches as a whole in TWPhrases conversion
  * Prevents splitting into "演" + "算法"

- Add "演算法→演算法" to STPhrases.txt (line 33598)
  * Defensive measure for segmentation stage
  * Improves conversion performance

### 2. Reverse Dictionary Generation Fix

Modified data/scripts/common.py reverse_items() to prioritize
non-identity mappings:

Before: 演算法 → 演算法 算法  (picks first: 演算法 ✗)
After:  演算法 → 算法 演算法    (picks first: 算法 ✓)

This ensures tw2sp correctly converts "演算法" → "算法" while s2twp
handles "演算法" → "演算法" without duplication.

## Test Results

✅ s2twp "演算法" → "演算法" (no duplication)
✅ s2twp "算法" → "演算法" (normal conversion)
✅ s2twp "排序算法很重要" → "排序演算法很重要" (phrase conversion)
✅ tw2sp "演算法" → "算法" (reverse conversion correct)
✅ case_030: s2twp "...算法..." → "...演算法..." (S→T correct)
✅ case_046: tw2sp "...演算法..." → "...算法..." (T→S correct)

## Changes

Dictionary files:
- data/dictionary/TWPhrasesIT.txt: Add identity mapping
- data/dictionary/STPhrases.txt: Add identity mapping

Scripts:
- data/scripts/common.py: Prioritize non-identity in reverse_items()

Tests:
- test/testcases/testcases.json: Add regression tests for issue BYVoid#950
- wasm-lib/test/testcases.json: Sync test cases

Binary dictionaries (rebuilt):
- wasm-lib/data/dict/STPhrases.ocd2
- wasm-lib/data/dict/TWPhrases.ocd2
- wasm-lib/data/dict/TWPhrasesRev.ocd2
- wasm-lib/dist/data/dict/*.ocd2

Other:
- .gitignore: Add build-temp/
- ISSUE_950_ANALYSIS.md: Complete technical analysis and lessons learned

Fixes: BYVoid#950
…0-6cFxr

Fix: resolve issue BYVoid#950 character duplication in s2twp conversion
添加基於《通用規範漢字表》(2013) 的繁簡轉換模式,支持將各種繁體標準
轉換為中國政府規範繁體字。

1. **t2cngov.json** - 繁體到政府標準(全轉換)
   - 繁體異體標準化:溼 → 濕
   - 簡體轉標準繁體:湿 → 濕
   - 部分繁簡轉換:淨 → 净

2. **t2cngov_keep_simp.json** - 繁體到政府標準(保留簡體)
   - 保留原文中有意使用的簡體字
   - 僅轉換繁體異體字

第三方字典來源:
- 作者:TerryTian-tech
- 許可證:Apache License 2.0
- 參考標準:《通用規範漢字表》(2013)

字典文件:
- TGCharacters.txt (37KB → 45KB ocd2) - 約 4000 個字符映射
- TGCharacters_keep_simp.txt (13KB → 21KB ocd2) - 保留簡體變體
- TGPhrases.txt (1.1MB → 911KB ocd2) - 約 7000 個詞組映射

- data/CMakeLists.txt: 構建 cngov 字典(扁平命名,分層安裝)
- test/CMakeLists.txt: 整合測試用例

- data/dictionary/cngov/BUILD.bazel: cngov 字典構建規則
- data/config/BUILD.bazel: 新增 cngov_validation_test
- test/testcases/BUILD.bazel: 新增 cngov_testcases filegroup

- test/CommandLineConvertTest.cpp: 新增 ConvertCNGovFromJson 測試函數
- test/testcases/cngov_testcases.json: 5 個專屬測試用例

- data/config/CNGovValidationTest.cpp: 獨立的 Bazel 測試
- 測試命令:
  * bazel test //data/config:cngov_validation_test
  * bazel test //data/...

- wasm-lib/data/dict/cngov/*.ocd2: 編譯後的字典
- wasm-lib/test/cngov_testcases.json: 測試用例
- wasm-lib/test/cngov.test.js: Node.js 測試代碼
- wasm-lib/scripts/refresh_assets.sh: 更新以支持子目錄和 cngov

- README.md: 新增 CN Government Standard Mode 使用說明
- wasm-lib/README.md & README.zh.md: 配置表新增 t2cngov 條目
- data/dictionary/cngov/README.txt: 字典來源和版權聲明

```bash
echo "盫" | opencc -c t2cngov.json              # → 盦
echo "简体混杂繁體" | opencc -c t2cngov.json    # → 簡體混雜繁體
echo "潮溼的露臺" | opencc -c t2cngov.json      # → 潮濕的露臺
echo "一乾二淨" | opencc -c t2cngov.json        # → 一乾二净
```

- 子目錄隔離:第三方字典放在 data/dictionary/cngov/
- 獨立測試:避免與上游 testcases.json 合併衝突
- 雙構建系統:同時支持 CMake 和 Bazel
- 完整元數據:JSON 配置包含作者、許可證、貢獻者信息
- 字典壓縮:ocd2 格式體積減少 70-80%

基於 TerryTian-tech 的研究成果,整合時遵循 Apache License 2.0。
貢獻者:TerryTian-tech, Yi Jianpeng, Hu Xinmei, Duan Yatong
From BYVoid#992

This mapping represents Japanese glyph normalization only. It MUST NOT affect any default Chinese Simplified/Traditional conversion paths.

Co-authored-by: SteveLz <stevel2520@gmail.com>
Ensures that the build is always run before publishing to npm,
preventing the publication of stale build artifacts.
- Add description_zh field with Traditional Chinese descriptions to all 14 config files
- Add examples field with typical conversion examples for each config
- Add config-examples.test.js to validate conversion examples
- Examples demonstrate differences between variants (e.g., s2tw vs s2twp)

Key updates:
- s2t, t2s, hk2t, t2hk, t2tw, tw2t: Use 'OpenCC標準繁體' to clarify this is an intermediate standard format
- jp2t, t2jp: Use Japanese-specific terms (財団法人, 桜花駅, 渋谷区, 図書館) to demonstrate Shinjitai ↔ Kyūjitai conversion
- package.json: Update test script to run config-examples.test.js
- dist/data/config/: Generated from data/config/ via npm run build

Helper scripts (test infrastructure):
- scripts/check-conversions.js: Verify conversion examples
- scripts/find-variant-examples.js: Find variant examples

All 101 tests passing.
@frankslin frankslin force-pushed the claude/add-opencc-config-docs-3GMxx branch from 7d1f7ed to fda9011 Compare January 6, 2026 08:10
@frankslin frankslin force-pushed the master branch 10 times, most recently from 1f6363e to 9c6855e Compare January 14, 2026 07:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants