maalfrid_toolkit is a Python package designed for crawling and extracting natural language data from documents found on the web (HTML, PDF, DOC). It is primarily used in the Målfrid project, a collaboration between the National Library of Norway and The Language Council of Norway, which aims to measure the usage of the two official Norwegian language forms, Bokmål and Nynorsk, on Norwegian public sector websites. While the toolkit has a particular emphasis on the Nordic countries, it supports extraction and language detection of more than 60 languages. The maalfrid_toolkit is also used to produce the yearly Målfrid dataset (freely available documents from Norwegian state institutions).
It builds upon:
- wget and (custom) browsertrix for crawling
- JusText for HTML boilerplate removal
- Notram PDF text extraction from NB AI-lab
- DOC extraction using docx2txt and antiword
- Gielladetect/pytextcat and GlotLID V3 for language detection
- Simhash for near-duplicate detection
pip install maalfrid_toolkitWith Glotlid / fasttext (optional, see below for caveats):
pip install maalfrid_toolkit[glotlid]pdm installpython -m maalfrid_toolkit.pipeline --url https://www.nb.no/utstilling/opplyst-glimt-fra-en-kulturhistorie/ --to_jsonlpython -m maalfrid_toolkit.pipeline --url https://www.nb.no/sbfil/dok/nst_taledat_dk.pdf --to_jsonlpython -m maalfrid_toolkit.pipeline --url https://www.nb.no/content/uploads/2018/11/Søknadsskjema-Bokhylla-2.doc --to_jsonlpython -m maalfrid_toolkit.pipeline --warc_file example_com-00000.warc.gz --calculate_simhash --to_jsonl > warc.jsonlpython -m maalfrid_toolkit.pipeline --url https://example.com/sitemap.xml --crawl_sitemap --to_jsonl > example.jsonl- mode: Choose between 'precision' (default) and 'recall'. Recall will give you more language content but probably at the expense of more noise.
- use_lenient_html_parser: Use a lenient HTML parser to fix broken HTML (more expensive).
- extract_metadata: Extract metadata from the document and try to infer document publish date.
If you want to store and process the data further in a database, setup a Postgres database and enter your credentials in an .env file in the package root directory (see env-example). Be sure to populate the database with schema and indices found in db/ prior to running the commands in maalfrid_toolkit.db.
sudo apt-get install build-essential python3-devsudo apt-get install antiwordIn order to use Browsertrix for crawling JavaScript-heavy pages and extract text from HTML, you must currently clone a custom Browsertrix from:
https://github.com/Sprakbanken/browsertrix-crawler/tree/add-dom-resource
Then build with Docker:
docker build -t maalfrid-browsertrix .GPL