TiSpell: A Semi-Masked Methodology for Tibetan Spelling Correction

📄 Paper: TiSpell: A Semi-Masked Methodology for Tibetan Spelling Correction covering Multi-Level Error with Data Augmentation

🧑‍💻 Author: Yutong Liu, Xiao Feng, Ziyue Zhang, Yongbin Yu*, Cheng Huang, Fan Gao, Xiangxiang Wang*, Ban Ma-bao, Manping Fan, Thupten Tsering, Gadeng Luosang, Renzeng Duojie, Nyima Tashi

📦 Repository: Official implementation of TiSpell.

🧠 Overview

TiSpell is a Tibetan spelling correction algorithm specifically designed for multi-level orthographic errors. It proposes a semi-masked methodology that jointly models character-level, syllable-level, and word-level errors. With integrated data augmentation strategies, TiSpell improves robustness and accuracy in real-world spelling correction tasks. It leverages pre-trained language models and introduces an end-to-end correction architecture.

✨ Features

✅ Handles character-, syllable-, and word-level spelling errors
✅ Combines semi-masked modeling with structural reconstruction
✅ Includes multiple data augmentation techniques (perturbation, phonetic substitution, etc.)
✅ Fully compatible with Huggingface Transformers for easy integration and customization

🗂️ Project Structure

TiSpell/
├── dataloader/ # Data loading utilities
├── dataset/ # Preprocessed and raw datasets
├── images/ # Visualizations
├── model/ # Model architecture
├── pretrained_models/ # Checkpoints and pre-trained weights
├── scripts/ # Training and evaluation scripts
├── LICENSE
├── README.md
├── compute_parameter.py # Parameter counting utility
├── data_analysis.py # Exploratory data analysis
├── infer.py # Inference script
├── metrics.py # Evaluation metrics
├── option.py # Argument parsing
├── plot.py # Visualization utilities
├── train.py # Training script
└── requirements.txt # Python dependencies

🚀 Quick Start

🔧 1. Install Dependencies

pip install -r requirements.txt

📁 2. Prepare Dataset

Download the Tibetan News Classification dataset from Huggingface and place it under the dataset/ directory. Ensure that the dataset is formatted in the following structure:

TiSpell/
└── dataset/
    └── tibetan_news_classification/
        ├── 政务类
        ├── 教育类
        ├── 文化类
        ├── 旅游类
        ├── 时政类
        ├── 民生类
        ├── 法律类
        ├── 科技类
        ├── 经济类
        └── 艺术类
            ├── 0.txt
            ├── 1.txt
            ├── 2.txt
            └── ...

🏋️‍♂️ 3. Train the Model

python main.py

⚙️ Configuration

You can customize training and evaluation parameters in option.py, including:

Learning rate / Batch size
Training epochs
Weight decay

📌 Citation

If you find TiSpell helpful in your research, please cite our work:

@misc{liu2025tispellsemimaskedmethodologytibetan,
      title={TiSpell: A Semi-Masked Methodology for Tibetan Spelling Correction covering Multi-Level Error with Data Augmentation}, 
      author={Yutong Liu and Feng Xiao and Ziyue Zhang and Yongbin Yu and Cheng Huang and Fan Gao and Xiangxiang Wang and Ma-bao Ban and Manping Fan and Thupten Tsering and Cheng Huang and Gadeng Luosang and Renzeng Duojie and Nyima Tashi},
      year={2025},
      eprint={2505.08037},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.08037}, 
}

📝 License

This project is licensed under the MIT License. See the LICENSE file for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TiSpell: A Semi-Masked Methodology for Tibetan Spelling Correction

🧠 Overview

✨ Features

🗂️ Project Structure

🚀 Quick Start

🔧 1. Install Dependencies

📁 2. Prepare Dataset

🏋️‍♂️ 3. Train the Model

⚙️ Configuration

📌 Citation

📝 License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
dataloader		dataloader
dataset		dataset
images		images
model		model
pretrained_models		pretrained_models
scripts		scripts
LICENSE		LICENSE
README.md		README.md
compute_parameter.py		compute_parameter.py
data_analysis.py		data_analysis.py
f1_curve.png		f1_curve.png
f1_curve.py		f1_curve.py
infer.py		infer.py
metrics.py		metrics.py
option.py		option.py
plot.py		plot.py
requirements.txt		requirements.txt
train.py		train.py

License

Lyric0620/TiSpell

Folders and files

Latest commit

History

Repository files navigation

TiSpell: A Semi-Masked Methodology for Tibetan Spelling Correction

🧠 Overview

✨ Features

🗂️ Project Structure

🚀 Quick Start

🔧 1. Install Dependencies

📁 2. Prepare Dataset

🏋️‍♂️ 3. Train the Model

⚙️ Configuration

📌 Citation

📝 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages