Maalfrid toolkit

maalfrid_toolkit is a Python package designed for crawling and extracting natural language data from documents found on the web (HTML, PDF, DOC). It is primarily used in the Målfrid project, a collaboration between the National Library of Norway and The Language Council of Norway, which aims to measure the usage of the two official Norwegian language forms, Bokmål and Nynorsk, on Norwegian public sector websites. While the toolkit has a particular emphasis on the Nordic countries, it supports extraction and language detection of more than 60 languages. The maalfrid_toolkit is also used to produce the yearly Målfrid dataset (freely available documents from Norwegian state institutions).

It builds upon:

wget and (custom) browsertrix for crawling
JusText for HTML boilerplate removal
Notram PDF text extraction from NB AI-lab
DOC extraction using docx2txt and antiword
Gielladetect/pytextcat and GlotLID V3 for language detection
Simhash for near-duplicate detection

Install

Install with pip

pip install maalfrid_toolkit

With Glotlid / fasttext (optional, see below for caveats):

pip install maalfrid_toolkit[glotlid]

Install with pdm

pdm install

Test run pipeline

On HTML

python -m maalfrid_toolkit.pipeline --url https://www.nb.no/utstilling/opplyst-glimt-fra-en-kulturhistorie/ --to_jsonl

On PDF

python -m maalfrid_toolkit.pipeline --url https://www.nb.no/sbfil/dok/nst_taledat_dk.pdf --to_jsonl

On DOC

python -m maalfrid_toolkit.pipeline --url https://www.nb.no/content/uploads/2018/11/Søknadsskjema-Bokhylla-2.doc --to_jsonl

On (W)ARC file (e.g. from self-crawled material)

python -m maalfrid_toolkit.pipeline --warc_file example_com-00000.warc.gz --calculate_simhash --to_jsonl > warc.jsonl

On sitemap

python -m maalfrid_toolkit.pipeline --url https://example.com/sitemap.xml --crawl_sitemap --to_jsonl > example.jsonl

Useful extraction otpions

mode: Choose between 'precision' (default) and 'recall'. Recall will give you more language content but probably at the expense of more noise.
use_lenient_html_parser: Use a lenient HTML parser to fix broken HTML (more expensive).
extract_metadata: Extract metadata from the document and try to infer document publish date.

Database (Postgres)

If you want to store and process the data further in a database, setup a Postgres database and enter your credentials in an .env file in the package root directory (see env-example). Be sure to populate the database with schema and indices found in db/ prior to running the commands in maalfrid_toolkit.db.

OS-level dependencies (tested with Ubuntu 24.04) for optional functionality

For fasttext (optional)

sudo apt-get install build-essential python3-dev

For .doc text extraction (optional)

sudo apt-get install antiword

A note on using Browsertrix

In order to use Browsertrix for crawling JavaScript-heavy pages and extract text from HTML, you must currently clone a custom Browsertrix from:

https://github.com/Sprakbanken/browsertrix-crawler/tree/add-dom-resource

Then build with Docker:

docker build -t maalfrid-browsertrix .

License

GPL

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
.github/workflows		.github/workflows
db		db
src		src
tests		tests
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
env.example		env.example
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Maalfrid toolkit

Install

Install with pip

Install with pdm

Test run pipeline

On HTML

On PDF

On DOC

On (W)ARC file (e.g. from self-crawled material)

On sitemap

Useful extraction otpions

Database (Postgres)

OS-level dependencies (tested with Ubuntu 24.04) for optional functionality

For fasttext (optional)

For .doc text extraction (optional)

A note on using Browsertrix

License

About

Uh oh!

Releases 16

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

NationalLibraryOfNorway/maalfrid_toolkit

Folders and files

Latest commit

History

Repository files navigation

Maalfrid toolkit

Install

Install with pip

Install with pdm

Test run pipeline

On HTML

On PDF

On DOC

On (W)ARC file (e.g. from self-crawled material)

On sitemap

Useful extraction otpions

Database (Postgres)

OS-level dependencies (tested with Ubuntu 24.04) for optional functionality

For fasttext (optional)

For .doc text extraction (optional)

A note on using Browsertrix

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 16

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages