This package provides implementations for several algorithms by which words in a wordlist can be segmented into morphemes.
If you use this software package, please cite it accordingly:
Rubehn, A. and J.-M. List (2025). MorSeg: A Python package for morpheme segmentation in multi- and monolingual wordlists [Software Library, Version 0.1]. Chair for Multilingual Computational Linguistics, University of Passau.
This package can be conveniently installed using pip:
pip install morseg
Assuming your data is presented in a TSV file following the LingPy specifications (see /tests/test_data/german.tsv for an example), you can simply load your data with:
from morseg.utils.wrappers import WordlistWrapper
wl = WordlistWrapper.from_file(YOUR_FILE)This creates a wordlist wrapper object; a representation of a wordlist with three annotation levels: The predicted segmentations (by a model), the Gold standard segmentations, and the unsegmented form. The training of all models requires the data to be stored in this class!
The Tokenizer class offers a unified interface for all models that are implemented in this library. For example, if you want to train a LSV (Letter Successor Variety) model, you can simply do so like that:
from morseg.algorithms.tokenizer import LSVTokenizer
model = LSVTokenizer()
model.train(wl)The current release covers implementations of the following models:
LSVTokenizer: Letter Successor Variety (Harris, 1955) with the following adaptations:- Letter Successor Entropy (Hafer and Weiss, 1974)
- Letter Max-Drop Variety (Hammarström, 2009)
- Normalized Letter Successor Variety (Çöltekin, 2010)
LPVTokenizer: Letter Predecessor Variety (analogically to LSV, but processing the words backwards)LSPVTokenizer: A combination of Letter Successor Variety and Letter Predecessor VarietyMorfessor: The Morfessor Baseline Model (Creutz and Lagus, 2002)SquareEntropyTokenizer(Méndez-Cruz et al., 2016)
Furthermore, some popular models for subword tokenization are implemented:
PairEncoding: Byte-Pair Encoding (Sennrich et al., 2016)WordPiece(Schuster and Nakajima, 2012)UnigramSentencePiece(Kudo, 2018)
You can obtain the predicted segmentations from your training data by calling:
for segmented_word in model.get_segmentations():
# do whateverYou can also try segmenting unseen words (depending on the model, this might work more or less well):
word = ["w", "o", "r", "d"]
segmented_word = model(word)