A Python library designed for normalizing Kashmiri text (Persio-Arabic script). This tool standardizes text by handling character variations, consistent punctuation spacing, and digit conversion. It is optimized for Natural Language Processing (NLP) pipelines and Machine Learning data preprocessing.
- Character Canonicalization: Maps multiple Unicode variants of Kashmiri characters to a single standard form using extensive character maps.
- Code standardization: Handles common inconsistencies in Kashmiri typing.
- Punctuation & Spacing: Automatically removes spaces before punctuation marks and ensures a single space follows them.
- Digit Normalization: Converts Kashmiri (Persio-Arabic) digits to standard English (Latin) digits for consistency.
Ensure you have Python 3.8 or higher installed.
You can install the package directly from GitHub:
pip install git+https://github.com/abdulmuizz0903/KashmiriNormalizer.gitThe normalize method is intended for cleaning text data. It performs canonicalization, digit conversion, and punctuation fixing.
from KashmiriNormalizer import KashmiriNormalizer
# Initialize the normalizer
kn = KashmiriNormalizer()
text = "مےٚ چُھ لۄکچارٕ پٮ۪ٹھٕ یہٕ عادت" # Example text
# Normalize the text
cleaned_text = kn.normalize(text)
print(cleaned_text)The library automatically converts Kashmiri digits to English digits during normalization.
digit_text = "١٢٣٤٥"
print(kn.normalize(digit_text))
# Output will have standardized English digitsThe library includes a specialized TTSNormalizer class tailored for Text-to-Speech tasks. This class extends the base normalization set with:
- Preserves Diacritics: Does not remove diacritics, which are crucial for correct pronunciation in Kashmiri.
- Digit Expansion: Converts digits (both Kashmiri and English) into their Kashmiri word forms (e.g., "1" -> "اکھ").
- Note: Requires populating the
WORD_TO_DIGIT_MAPinconstants.py.
- Note: Requires populating the
- Plat Ye Handling: Converts
ؠtoۍat the end of words to align with standard writing rules. - Character Filtering: Removes any characters not present in the allowed Kashmiri character set (
ALL_CHARACTERS), ensuring clean input for TTS models.
from KashmiriNormalizer import TTSNormalizer
# Initialize the TTS normalizer
tts_norm = TTSNormalizer()
text = "مےٚ چُھ 1 لۄکچارٕ پٮ۪ٹھٕ یہٕ عادت۔"
# Normalize for TTS
tts_text = tts_norm.normalize(text)
print(tts_text)
# Output will have diacritics preserved, digits expanded to words, and non-Kashmiri chars removed.- regex: Used for advanced Unicode string handling.
The project structure is as follows:
KashmiriNormalizer/
├── src/
│ └── KashmiriNormalizer/
│ ├── __init__.py
│ ├── constants.py # Character maps and regex constants
│ └── normalizer.py # Main Normalizer class
└── pyproject.toml # Build configuration and dependencies