korapxmltool

Converts between KorAP-XML ZIP format and formats like CoNLL-U, Krill, word2vec, NOW and annotates KorAP XML ZIPs with various taggers and parsers.

Drop-in replacement for korapxml2conllu, conllu2korapxml (see KorAP-XML-CoNLL-U) and korapxml2krill (see KorAP-XML-Krill), but still in the experimental stage. This concerns in particular the command line options, which might not yet be fully tested in all combinations.

Prerequisites

This tool is designed to be run on a Unix-like system (e.g., Linux, macOS).

For running the JAR file (`korapxml2conllu.jar`):

Java 21 or higher is required.

For running the executables in `build/bin/`:

In addition to Java 21, the following are required for the wrapper scripts:

A Unix-like environment with a bash shell.
The stat command-line utility.
For automatic memory detection, the system needs to support cgroups, specifically /sys/fs/cgroup/memory.

Build

./gradlew build

After building, a fat jar file will be available at ./app/build/libs/korapxmltool.jar. In addition, the executable korapxmltool, as well as the symbolic link shortcuts korapxml2conllu and korapxml2krill, will be available at ./build/bin/.

Command Line Options

Key options for korapxmltool (>= v3.1):

-t FORMAT, --to FORMAT: Output format (zip, conllu, w2v, now, krill)
-j N, --jobs N, --threads N: Number of threads/jobs to use
-T TAGGER[:MODEL], --tag-with TAGGER[:MODEL]: POS tagger and optional model
-P PARSER[:MODEL], --parse-with PARSER[:MODEL]: Parser and optional model
-f, --force: Overwrite existing output files
-q, --quiet: Suppress progress output
-D DIR, --output-dir DIR: Output directory
-L DIR, --log-dir DIR: Log directory
--lemma: Use lemmas instead of surface forms (when available)
--lemma-only: Skip loading base tokens, output only lemmas

Conversion to CoNLL-U format

$ ./build/bin/korapxmltool app/src/test/resources/wdf19.zip | head -10

# foundry = base
# filename = WDF19/A0000/13072/base/tokens.xml
# text_id = WDF19_A0000.13072
# start_offsets = 0 0 14 17 25 30 35 42 44 52 60 73
# end_offsets = 74 12 16 24 29 34 41 43 51 59 72 74
1	Australasien	_	_	_	_	_	_	_	_
2	on	_	_	_	_	_	_	_	_
3	devrait	_	_	_	_	_	_	_	_
4	peut	_	_	_	_	_	_	_	_
5	être	_	_	_	_	_	_	_	_

Conversion to language model training data input format from KorAP-XML

$ ./build/bin/korapxmltool -t w2v app/src/test/resources//wdf19.zip

Arts visuels Pourquoi toujours vouloir séparer BD et Manga ?
Ffx 18:20 fév 25 , 2003 ( CET ) soit on ne sépara pas , soit alors on distingue aussi , le comics , le manwa , le manga ..
la bd belge et touts les auteurs européens ..
on commence aussi a parlé de la bd africaine et donc ...
wikipedia ce prete parfaitement à ce genre de decryptage .
…

Example producing language model training input with preceding metadata columns

./build/bin/korapxmltool -m '<textSigle>([^<]+)' -m '<creatDate>([^<]+)' -t w2v app/src/test/resources//wdf19.zip

WDF19/A0000.10894	2014.08.28	Arts visuels Pourquoi toujours vouloir séparer BD et Manga ?
WDF19/A0000.10894	2014.08.28	Ffx 18:20 fév 25 , 2003 ( CET ) soit on ne sépara pas , soit alors on distingue aussi , le comics , le manwa , le manga ..
WDF19/A0000.10894	2014.08.28	la bd belge et touts les auteurs européens ..
WDF19/A0000.10894	2014.08.28	on commence aussi a parlé de la bd africaine et donc ...
WDF19/A0000.10894	2014.08.28	wikipedia ce prete parfaitement à ce genre de decryptage .

Conversion to a NOW corpus format variant (example)

One text per line with <p> as sentence delimiter.

./build/bin/korapxmltool -t now /vol/corpora/DeReKo/current/KorAP/zip/*24.zip | pv > dach24.txt

Using lemmas instead of surface forms in word2vec / NOW output

If lemma annotations (morpho layer) are present alongside the base tokens, you can output lemmas instead of surface tokens with --lemma.

# Word2Vec style output with lemmas where available
./build/bin/korapxmltool --lemma -t w2v app/src/test/resources/goe.tree_tagger.zip | head -3

# NOW corpus style output with lemmas
./build/bin/korapxmltool --lemma -t now app/src/test/resources/goe.tree_tagger.zip | head -1

If a lemma for a token is missing (_) the surface form is used as fallback.

Lemma-only mode and I/O scheduling

--lemma-only: For -t w2v and -t now, skip loading data.xml and output only lemmas from morpho.xml. This reduces memory and speeds up throughput.
--sequential: Process entries inside each zip sequentially (zips can still run in parallel). Recommended for w2v/now to keep locality and lower memory.
--exclude-zip-glob GLOB (repeatable): Skip zip basenames that match the glob (e.g., --exclude-zip-glob 'w?d24.tree_tagger.zip').

Example for large NOW export with progress and exclusions:

KORAPXMLTOOL_XMX=64g KORAPXMLTOOL_MODELS_PATH=/data/models KORAPXMLTOOL_JAVA_OPTS="-XX:+UseG1GC -Djdk.util.zip.disableMemoryMapping=true -Djdk.util.zip.reuseInflater=true" \
     ./build/bin/korapxmltool -l info -j 100 \
     --lemma-only --sequential -t now \
     --exclude-zip-glob 'w?d24.tree_tagger.zip' \
     /vol/corpora/DeReKo/current/KorAP/zip/*24.tree_tagger.zip | pv > dach2024.lemma.txt

At INFO level the tool logs:

The zip processing order with file sizes (largest-first in --lemma-only).
For each zip: start message including its size and a completion line with cumulative progress, ETA and average MB/s.

Conversion to Krill (KoralQuery) JSON format

Generate a tar archive containing gzipped Krill/KoralQuery JSON files across all provided foundries.

./build/bin/korapxmltool -t krill -D out/krill \
  app/src/test/resources/wud24_sample.zip \
  app/src/test/resources/wud24_sample.spacy.zip \
  app/src/test/resources/wud24_sample.marmot-malt.zip

This writes out/krill/wud24_sample.krill.tar plus a log file. Add more annotated KorAP-XML zips (e.g., TreeTagger, CoreNLP) to merge their layers into the same Krill export; use --non-word-tokens if punctuation should stay in the token stream.

Annotation

Tagging with integrated MarMoT POS tagger directly to a new KorAP-XML ZIP file

You need to download the pre-trained MarMoT models from the MarMoT models repository.

You can specify the full path to the model, or set the KORAPXMLTOOL_MODELS_PATH environment variable to specify a default search directory:

# With full path
./build/bin/korapxmltool -t zip -T marmot:models/de.marmot app/src/test/resources/goe.zip

# With KORAPXMLTOOL_MODELS_PATH (searches in /data/models/ if model not found locally)
export KORAPXMLTOOL_MODELS_PATH=/data/models
./build/bin/korapxmltool -t zip -T marmot:de.marmot app/src/test/resources/goe.zip

# Without setting KORAPXMLTOOL_MODELS_PATH (searches current directory only)
./build/bin/korapxmltool -t zip -T marmot:models/de.marmot app/src/test/resources/goe.zip

Tagging with integrated OpenNLP POS tagger directly to a new KorAP-XML ZIP file

You need to download the pre-trained OpenNLP models from the OpenNLP model download page or older models from the legacy OpenNLP models archive.

./build/bin/korapxmltool -t zip -T opennlp:/usr/local/kl/korap/Ingestion/lib/models/opennlp/de-pos-maxent.bin /tmp/zca24.zip

Tag and lemmatize with integrated TreeTagger

(Requires Docker)

./build/bin/korapxmltool -T treetagger:german -t zip app/src/test/resources/wdf19.zip

See TreeTagger Docker Image with CoNLL-U Support.

Tag and lemmatize with integrated spaCy to CoNLL-U

(Requires Docker)

./build/bin/korapxmltool -j 1 -T spacy ./app/src/test/resources/goe.zip | less

Tag, lemmatize and dependency parse with spaCy directly to a new KorAP-XML ZIP file

./build/bin/korapxmltool -P spacy -t zip ./app/src/test/resources/goe.zip

Tag, lemmatize and constituency parse with CoreNLP (3.X) directly to a new KorAP-XML ZIP file

Download the Stanford CoreNLP v3.X POS tagger and constituency parser models (e.g., german-fast.tagger and germanSR.ser.gz) into libs/.

./build/bin/korapxmltool -t zip -D out \
  -T corenlp:libs/german-fast.tagger \
  -P corenlp:libs/germanSR.ser.gz \
  app/src/test/resources/wud24_sample.zip

The resulting out/wud24_sample.corenlp.zip contains corenlp/morpho.xml and corenlp/constituency.xml alongside the base tokens.

Parse using the integrated Maltparser directly to a new KorAP-XML ZIP file

You need to download the pre-trained MaltParser models from the MaltParser model repository. Note that parsers take POS tagged input.

./build/bin/korapxmltool -t zip -j2 -P malt:german.mco goe.tree_tagger.zip

Tag with MarMoT and parse with Maltparser in one run directly to a new KorAP-XML ZIP file

./build/bin/korapxmltool -t zip -T marmot:models/de.marmot -P malt:german.mco goe.zip

Development and License

Author:

Marc Kupietz

This package is developed as part of the KorAP Corpus Analysis Platform at the Leibniz Institute for German Language (IDS).

It is published under the GNU General Public License, Version 3, 29 June 2007.

Contributions

Contributions are very welcome!

Your contributions should ideally be committed via our Gerrit server to facilitate reviewing ( see Gerrit Code Review - A Quick Introduction if you are not familiar with Gerrit). However, we are also happy to accept comments and pull requests via GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 298 Commits
.github		.github
.idea		.idea
app		app
gradle/wrapper		gradle/wrapper
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Readme.md		Readme.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
korapxmltool.shebang		korapxmltool.shebang
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

korapxmltool

Prerequisites

For running the JAR file (`korapxml2conllu.jar`):

For running the executables in `build/bin/`:

Build

Command Line Options

Conversion to CoNLL-U format

Conversion to language model training data input format from KorAP-XML

Example producing language model training input with preceding metadata columns

Conversion to a NOW corpus format variant (example)

Using lemmas instead of surface forms in word2vec / NOW output

Lemma-only mode and I/O scheduling

Conversion to Krill (KoralQuery) JSON format

Annotation

Tagging with integrated MarMoT POS tagger directly to a new KorAP-XML ZIP file

Tagging with integrated OpenNLP POS tagger directly to a new KorAP-XML ZIP file

Tag and lemmatize with integrated TreeTagger

Tag and lemmatize with integrated spaCy to CoNLL-U

Tag, lemmatize and dependency parse with spaCy directly to a new KorAP-XML ZIP file

Tag, lemmatize and constituency parse with CoreNLP (3.X) directly to a new KorAP-XML ZIP file

Parse using the integrated Maltparser directly to a new KorAP-XML ZIP file

Tag with MarMoT and parse with Maltparser in one run directly to a new KorAP-XML ZIP file

Development and License

Contributions

About

Uh oh!

Releases 8

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

KorAP/korapxmltool

Folders and files

Latest commit

History

Repository files navigation

korapxmltool

Prerequisites

For running the JAR file (korapxml2conllu.jar):

For running the executables in build/bin/:

Build

Command Line Options

Conversion to CoNLL-U format

Conversion to language model training data input format from KorAP-XML

Example producing language model training input with preceding metadata columns

Conversion to a NOW corpus format variant (example)

Using lemmas instead of surface forms in word2vec / NOW output

Lemma-only mode and I/O scheduling

Conversion to Krill (KoralQuery) JSON format

Annotation

Tagging with integrated MarMoT POS tagger directly to a new KorAP-XML ZIP file

Tagging with integrated OpenNLP POS tagger directly to a new KorAP-XML ZIP file

Tag and lemmatize with integrated TreeTagger

Tag and lemmatize with integrated spaCy to CoNLL-U

Tag, lemmatize and dependency parse with spaCy directly to a new KorAP-XML ZIP file

Tag, lemmatize and constituency parse with CoreNLP (3.X) directly to a new KorAP-XML ZIP file

Parse using the integrated Maltparser directly to a new KorAP-XML ZIP file

Tag with MarMoT and parse with Maltparser in one run directly to a new KorAP-XML ZIP file

Development and License

Contributions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

For running the JAR file (`korapxml2conllu.jar`):

For running the executables in `build/bin/`:

Packages