Skip to content

lingpy/morseg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Morpheme Segmentation in Multi- and Monolingual Wordlists

This package provides implementations for several algorithms by which words in a wordlist can be segmented into morphemes.

If you use this software package, please cite it accordingly:

Rubehn, A. and J.-M. List (2025). MorSeg: A Python package for morpheme segmentation in multi- and monolingual wordlists [Software Library, Version 0.1]. Chair for Multilingual Computational Linguistics, University of Passau.

Installation

This package can be conveniently installed using pip:

pip install morseg

Basic Usage

Loading data

Assuming your data is presented in a TSV file following the LingPy specifications (see /tests/test_data/german.tsv for an example), you can simply load your data with:

from morseg.utils.wrappers import WordlistWrapper

wl = WordlistWrapper.from_file(YOUR_FILE)

This creates a wordlist wrapper object; a representation of a wordlist with three annotation levels: The predicted segmentations (by a model), the Gold standard segmentations, and the unsegmented form. The training of all models requires the data to be stored in this class!

Training a model

The Tokenizer class offers a unified interface for all models that are implemented in this library. For example, if you want to train a LSV (Letter Successor Variety) model, you can simply do so like that:

from morseg.algorithms.tokenizer import LSVTokenizer

model = LSVTokenizer()
model.train(wl)

The current release covers implementations of the following models:

Furthermore, some popular models for subword tokenization are implemented:

Obtain segmentations

You can obtain the predicted segmentations from your training data by calling:

for segmented_word in model.get_segmentations():
    # do whatever

You can also try segmenting unseen words (depending on the model, this might work more or less well):

word = ["w", "o", "r", "d"]
segmented_word = model(word)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages