diff --git a/site/en/models.md b/site/en/models.md index 98e6f4078..de43c542b 100644 --- a/site/en/models.md +++ b/site/en/models.md @@ -12,118 +12,254 @@ its own internal format(s) for models. Some support central storage of models at a specific location (tesseract, ocropy, kraken) while others require the full path to a model (calamari). -Likewise, model distribution is not currently centralised within OCR-D though we +Since [v2.22.0](https://github.com/OCR-D/core/releases/v2.22.0), OCR-D/core +comes with a framework for managing processor resources uniformly. This means +that processors can delegate to OCR-D/core to resolve specific file resources by name, +looking in well-defined places in the filesystem. This also includes downloading and caching +file parameters passed as a URL. Furthermore, OCR-D/core comes with a bundled database +of known resources, such as models, dictionaries, configurations and other +processor-specific data files. This means that OCR-D users should be able to +concentrate on fine-tuning their OCR workflows and not bother with implementation +details like "where do I get models from and where do I put them". +In particular, users can reference file parameters by name now. + +All of the above mentioned functionality can be accessed using the `ocrd +resmgr` command line tool. + +## What models are available? + +To get a list of the resources that the OCR-D/core [is aware +of](https://github.com/OCR-D/core/blob/master/ocrd/ocrd/resource_list.yml): + +``` +ocrd resmgr list-available +``` + +The output will look similar to this: + +``` + +ocrd-calamari-recognize +- qurator-gt4hist-0.3 (https://qurator-data.de/calamari-models/GT4HistOCR/2019-07-22T15_49+0200/model.tar.xz) + Calamari model trained with GT4HistOCR +- qurator-gt4hist-1.0 (https://qurator-data.de/calamari-models/GT4HistOCR/2019-12-11T11_10+0100/model.tar.xz) + Calamari model trained with GT4HistOCR + +ocrd-cis-ocropy-recognize +- LatinHist.pyrnn.gz (https://github.com/chreul/OCR_Testdata_EarlyPrintedBooks/raw/master/LatinHist-98000.pyrnn.gz) + ocropy historical latin model by github.com/chreul +``` + +As you can see, resources are grouped by the processors which make use of them. + +The word after the list symbol, e.g. `qurator-gt4hist-0.3`, +`LatinHist.pyrnn.gz`, defines the _name_ of the resource, which is a shorthand you can +use in parameters without having to specify the full URL (in brackets after the +name). + +The second line of each entry contains a short description of the resource. + +## Installing known resources + +You can install resources with the `ocrd resmgr download` command. It expects +the name of the processor as the first argument and either the name or URL of a +resource as a second argument. + +Although model distribution is not currently centralised within OCR-D, we are working towards a central model repository. -In the meantime, this guide will show you, for each OCR engine: +For example, to install the `LatinHist.pyrnn.gz` resource for `ocrd-cis-ocropy-recognize`: - * which types of models are supported - * where to store models locally - * which currently available models we recommend - * how to invoke the resp. OCR-D wrapper for the engine with a specific model +``` +ocrd resmgr download ocrd-cis-ocropy-recognize LatinHist.pyrnn.gz +# or +ocrd resmgr download ocrd-cis-ocropy-recognize https://github.com/chreul/OCR_Testdata_EarlyPrintedBooks/raw/master/LatinHist-98000.pyrnn.gz +``` -## Tesseract / ocrd_tesserocr +This will look up the resource in the [bundled resource and user databases](#user-database), download, +unarchive (where applicable) and store it in the [proper location](#where-is-the-data). -Tesseract models are single files with a `.traineddata` extension. -Tesseract expects models to be in a directory `tessdata` within what Tesseract -calls `TESSDATA_PREFIX`. When installing Tesseract from Ubuntu packages, that -location is `/usr/share/tesseract-ocr/4.00/tessdata`. When building from source -using [ocrd_all](htttps://github.com/OCR-D/ocrd_all), the models are searched -at `/path/to/ocrd_all/venv/share/tessdata`. If you want to override the -locations, you can set the `TESSDATA_PREFIX` environment variable, e.g. if you -want the models location to be `$HOME/tessdata`, you can by adding to your -`$HOME/.bashrc`: `export TESSDATA_PREFIX=$HOME`. - -We recommend you download the following models, either by downloading and -saving to the right location or by running `make install-models-tesseract` when -using `ocrd_all`: - - * [equ](https://github.com/tesseract-ocr/tessdata_fast/raw/master/equ.traineddata) - * [osd](https://github.com/tesseract-ocr/tessdata_fast/raw/master/osd.traineddata) - * [eng](https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata) - * [deu](https://github.com/tesseract-ocr/tessdata_fast/raw/master/deu.traineddata) - * [frk](https://github.com/tesseract-ocr/tessdata_fast/raw/master/frk.traineddata) - * [script/Latin](https://github.com/tesseract-ocr/tessdata_fast/raw/master/script/Latin.traineddata) - * [script/Fraktur](https://github.com/tesseract-ocr/tessdata_fast/raw/master/script/Fraktur.traineddata) - * [@stweil's GT4HistOCR model](https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/Fraktur_5000000/tessdata_fast/Fraktur_50000000.334_450937.traineddata) - -If you installed Tesseract with Ubuntu's `apt` package manager, you may want to install -standard models like `deu` or `script/Fraktur` with `apt`: +**NOTE:** The special name `*` can be used instead of a resource name/url to +download *all* known resources for this processor. To download all tesseract models: ```sh -sudo apt install tesseract-ocr-deu tesseract-ocr-script-frak +ocrd resmgr download ocrd-tesserocr-recognize '*' ``` -**NOTE:** When installing with `apt`, he `script/*` models are installed -without the `script/` prefix, so `script/Latin` becomes just `Latin`, -`script/Fraktur` becomes `Fraktur` etc. +**NOTE:** Equally, the special processor `*` can be used instead of a processor and a resource +to download *all* known resources for *all* installed processors: -OCR-D's Tesseract wrapper, -[ocrd_tesserocr](https://github.com/OCR-D/ocrd_tesserocr) and more -specifically, the `ocrd-tesserocr-recognize` processor, expects the name of the -model(s) to be provided as the `model` parameter. Multiple models can be -combined by concatenating with `+` (which generally improves accuracy but always slows processing): + ocrd resmgr download '*' + +(In either case, `*` must be in quotes or escaped to avoid wildcard expansion by the shell.) + +## Installing unknown resources + +If you need to install a resource which OCR-D doesn't know of, that can be achieved by passings its URL in combination with the `--any-url/-n` flag to `ocrd resmgr download`: + +To install a model for `ocrd-tesserocr-recognize` that is located at `https://my-server/mymodel.traineddata`. + +``` +ocrd resmgr download -n ocrd-tesserocr-recognize https://my-server/mymodel.traineddata +``` + +This will download and store the resource in the [proper location](#where-is-the-data) and create a stub entry in the +[user database](#user-database). You can then use it as the parameter value for the `model` parameter: + +``` +ocrd-tesserocr-recognize -P model mymodel +``` + +## List installed resources + +The `ocrd resmgr list-installed` command has the same output format as `ocrd resmgr list-available`. But instead +of the database, it scans the filesystem locations [where data is searched](#where-is-the-data) for existing +resources and lists URL and description if a database entry exists. + +## User database + +Whenever the OCR-D/core resource manager encounters an unknown resource in the filesystem or when you install +a resource with `ocrd resmgr download`, it will create a new stub entry in the user database, which is found at +`$HOME/.config/ocrd/resources.yml` and created if it doesn't exist. + +This allows you to use the OCR-D/core resource manager mechanics, including +lookup of known resources by name or URL, without relying (only) on the +database maintained by the OCR-D/core developers. + +**NOTE:** If you produced or found resources that are interesting for the wider +OCR(-D) community, please tell us in the [OCR-D gitter +chat](https://gitter.im/OCR-D/Lobby) so we can add it to the database. + +## Where is the data + +The lookup algorithm is [defined in our specifications](https://ocr-d.de/en/spec/ocrd_tool#file-parameters) + +In order of preference, a resource `` for a processor `ocrd-foo` is searched at: + +* `$PWD/ocrd-resources/ocrd-foo/` +* `$XDG_DATA_HOME/ocrd-resources/ocrd-foo/` +* `/usr/local/share/ocrd-resources/ocrd-foo/` + +(where `XDG_DATA_HOME` defaults to `$HOME/.local/share` if unset). + +We recommend using the `$XDG_DATA_HOME` location, which is also the default. But +you can override the location to store data with the `--location` option, which can +be `cwd`, `data` and `system` resp. ```sh -# Use the deu and frk models -ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -p '{"model": "deu+frk"}' -# Use the script/Fraktur model -ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -p '{"model": "script/Fraktur"}' +# will download to $PWD/ocrd-resources/ocrd-anybaseocr-dewarp/latest_net_G.pth +ocrd resmgr download --location cwd ocrd-anybaseocr-dewarp latest_net_G.pth +# will download to /usr/local/share/ocrd-resources/ocrd-anybaseocr-dewarp/latest_net_G.pth +ocrd resmgr download --location system ocrd-anybaseocr-dewarp latest_net_G.pth ``` -## Ocropy / ocrd_cis +## Changing the default resource directory -An Ocropy model is simply the neural network serialized with Python's pickle -mechanism and is generally distributed in a gzipped form, with a `.pyrnn.gz` -extension. +The `$XDG_DATA_HOME` default location is reasonable because +models are usually large files which should persist across different deployments, +both native and containerized, both single-module and [ocrd_all](https://github.com/OCR-D/ocrd_all). +Moreover, that variable can easily be overridden during installation. + +However, there are use cases where `system` or even `cwd` should be +used as location to store resources, hence the `--location` option. -Ocropy has a rather convoluted algorithm to look up models, so we recommend you -explicitly set the `OCROPUS_DATA` variable to point to the directory with -ocropy's models. E.g. if you intend to store your models in `$HOME/ocropus-models`, add the following -to your `$HOME/.bashrc`: `export OCROPUS_DATA=$HOME/ocropus-models`. -We recommend you download the following models, either by downloading and -saving to the right location or by running `make install-models-ocropus` when -using `ocrd_all`: - * [en-default.pyrnn.gz](https://github.com/zuphilip/ocropy-models/raw/master/en-default.pyrnn.gz) - * [fraktur.pyrnn.gz](https://github.com/zuphilip/ocropy-models/raw/master/fraktur.pyrnn.gz) - * [@jze's fraktur.pyrnn.gz](https://github.com/jze/ocropus-model_fraktur/raw/master/fraktur.pyrnn.gz) (save as `fraktur-jze.pyrnn.gz`) - * [@chreul's LatinHist.pyrnn.gz](https://github.com/chreul/OCR_Testdata_EarlyPrintedBooks/raw/master/LatinHist-98000.pyrnn.gz) +## Notes on specific processors +## Ocropy / ocrd_cis -To use a specific model with OCR-D's ocropus wrapper in [ocrd_cis](https://github.com/cisocrgroup/ocrd_cis) and more specifically, the `ocrd-cis-ocropy-recognize` processor, use the `model` parameter: +An Ocropy model is simply the neural network serialized with Python's pickle +mechanism and is generally distributed in a gzipped form, with a `.pyrnn.gz` +extension and can be used as such, no need to unarchive. + +To use a specific model with OCR-D's ocropus wrapper in +[ocrd_cis](https://github.com/cisocrgroup/ocrd_cis) and more specifically, the +`ocrd-cis-ocropy-recognize` processor, use the `model` parameter: ```sh -ocrd-cis-ocropy-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-OCRO -p '{"model": "fraktur-jze.pyrnn.gz"}' +# Model will be downloaded on-demand if it is not locally available yet +ocrd-cis-ocropy-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-OCRO -P model fraktur-jze.pyrnn.gz ``` ## Calamari / ocrd_calamari Calamari models are Tensorflow model directories. For distribution, this directory is usually packed to a tarball or ZIP file. Once downloaded, these -containers must be unpacked to a directory again. - -As calamari does not have a model discovery setup, you must always provide the -path with a wildcard listing all `*.ckpt.json` ("checkpoint") files. - -We recommend you download the following model, either by downloading and -unpacking manually or by using `make install-models-calamari` if using -`ocrd_all`: - - * [@mike-gerber's GT4HistOCR model](https://qurator-data.de/calamari-models/GT4HistOCR/2019-12-11T11_10+0100/model.tar.xz) +containers must be unpacked to a directory again. `ocrd resmgr` handles this +for you, so you just need the name of the resource in the database. The Calamari-OCR project also maintains a [repository of models](https://github.com/Calamari-OCR/calamari_models). To use a specific model with OCR-D's calamari wrapper [ocrd_calamari](https://github.com/OCR-D/ocrd_calamari) and more specifically, -the `ocrd-calamari-recognize` processor, use the `checkpoint` parameter: +the `ocrd-calamari-recognize` processor, use the `checkpoint_dir` parameter: + +```sh +# To use the "default" model, i.e. the one trained on GT4HistOCR by QURATOR +ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA +# To use your own trained model +ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA -P checkpoint_dir /path/to/modeldir +# or, to be able to control which checkpoints to use: +ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA -P checkpoint '/path/to/modeldir/*.ckpt.json' +``` + +## Tesseract / ocrd_tesserocr + +Tesseract models are single files with a `.traineddata` extension. + +Since tesseract only supports model lookup in a single directory, models should +only be stored in a single location. If the default location (`virtualenv`) is +not the place you want to use for tesseract models, consider [changing the default location +in the OCR-D config file](#changing-the-default-resource-directory). + +**NOTE:** For reasons of effiency and to avoid duplicate models, all `ocrd-tesserocr-*` processors +reuse the resource directory for `ocrd-tesserocr-recognize`. + +If the `TESSDATA_PREFIX` environemnt variable is set when any of the tesseract processors +are called, it will be the location to look for resources instead of the default. + +OCR-D's Tesseract wrapper, +[ocrd_tesserocr](https://github.com/OCR-D/ocrd_tesserocr) and more +specifically, the `ocrd-tesserocr-recognize` processor, expects the name of the +model(s) to be provided as the `model` parameter. Multiple models can be +combined by concatenating with `+` (which generally improves accuracy but always slows processing): ```sh -ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA -p '{"checkpoint": "/path/to/model/*.ckpt.json"}' +# Use the deu and frk models +ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -P model 'deut+frk' +# Use the Fraktur model +ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -P Fraktur ``` +# Models and Docker + +We recommend keeping all downloaded resources in a persistent host directory, +separate of the `ocrd/*` Docker container and data directory, and mounting that +resource directory into a specific path in the container alongside the data directory. +The host resource directory can be empty initially. Each time you run the Docker container, +your processors will access the host directory to resolve resources, and you can download +additional models into that location using `ocrd resmgr`. + +The following will assume (without loss of generality) that your host-side data +path is under `./data`, and the host-side resource path is under `./models`: + +- To download models to `./models` in the host FS and `/usr/local/share/ocrd-resources` in Docker: + docker run --user $(id -u) \ + --volume $PWD/models:/usr/local/share/ocrd-resources \ + ocrd/all \ + ocrd resmgr download ocrd-tesserocr-recognize eng.traineddata\; \ + ocrd resmgr download ocrd-calamari-recognize default\; \ + ... +- To run processors, as usual do: + docker run --user $(id -u) --workdir /data \ + --volume $PWD/data:/data \ + --volume $PWD/models:/usr/local/share/ocrd-resources \ + ocrd/all ocrd-tesserocr-recognize -I IN -O OUT -P model eng + +This principle applies to all `ocrd/*` Docker images, e.g. you can replace `ocrd/all` above with `ocrd/tesserocr` as well. + # Model training With the pretrained models mentioned above, good results can be obtained for many originals. Nevertheless, the