From 6782c255a5d7e59a2d8a0334b1eafdf4dcc95f3b Mon Sep 17 00:00:00 2001 From: Konstantin Baierer Date: Mon, 18 Jan 2021 18:52:45 +0100 Subject: [PATCH 01/20] models: explain ocrd resmgr --- site/en/models.md | 247 ++++++++++++++++++++++++++++++++-------------- 1 file changed, 174 insertions(+), 73 deletions(-) diff --git a/site/en/models.md b/site/en/models.md index 98e6f4078..372b425fd 100644 --- a/site/en/models.md +++ b/site/en/models.md @@ -12,116 +12,217 @@ its own internal format(s) for models. Some support central storage of models at a specific location (tesseract, ocropy, kraken) while others require the full path to a model (calamari). +Since [v2.22.0](https://github.com/OCR-D/core/releases/v2.22.0), OCR-D/core +comes with a framework for managing processor resources uniformly. This means +that OCR-D/core will take care of lookin in well-defined places in the +filesystem for resources for specific processors. It also knows how to cache +file parameters passed as a URL. OCR-D/core also comes with a bundled database +of known resources, such as OCR models, configurations and other +processor-specific data. This means that OCR-D users should be able to +concentrate on fine-tuning their OCR workflows and not bother with implementation +details like "where do I get models from and where do I put them". + +All of the above mentioned functionality can be accessed using the `ocrd +resmgr` command line tool. + +## What models are available? + +To get a list of the resources that the OCR-D/core [is aware +of](https://github.com/OCR-D/core/blob/master/ocrd/ocrd/resource_list.yml): + +``` +ocrd resmgr list-available +``` + +The output will look similar to this: + +``` + +ocrd-calamari-recognize +- qurator-gt4hist-0.3 (https://qurator-data.de/calamari-models/GT4HistOCR/2019-07-22T15_49+0200/model.tar.xz) + Calamari model trained with GT4HistOCR +- qurator-gt4hist-1.0 (https://qurator-data.de/calamari-models/GT4HistOCR/2019-12-11T11_10+0100/model.tar.xz) + Calamari model trained with GT4HistOCR + +ocrd-cis-ocropy-recognize +- LatinHist.pyrnn.gz (https://github.com/chreul/OCR_Testdata_EarlyPrintedBooks/raw/master/LatinHist-98000.pyrnn.gz) + ocropy historical latin model by github.com/chreul +``` + +As you can see, resources are grouped by the processor they are used by. + +The word after the list symbol, e.g. `qurator-gt4hist-0.3`, +`LatinHist.pyrnn.gz`, define the "name" of the resource, a shorthand you can +use in parameters without having to specify the full URL (in brackets after the +name). + +The second line of each entry contains a short description of the resource. + +## Installing known resources + +You can install resources with the `ocrd resmgr download` command. It expects +the name of the processor as the first argument and either the name or URL of a +resource as a second argument. + Likewise, model distribution is not currently centralised within OCR-D though we are working towards a central model repository. -In the meantime, this guide will show you, for each OCR engine: +For example, to install the `LatinHist.pyrnn.gz` resource for `ocrd-cis-ocropy-recognize`: - * which types of models are supported - * where to store models locally - * which currently available models we recommend - * how to invoke the resp. OCR-D wrapper for the engine with a specific model +``` +ocrd resmgr download ocrd-cis-ocropy-recognize LatinHist.pyrnn.gz +# or +ocrd resmgr download ocrd-cis-ocropy-recognize https://github.com/chreul/OCR_Testdata_EarlyPrintedBooks/raw/master/LatinHist-98000.pyrnn.gz +``` -## Tesseract / ocrd_tesserocr +This will look up the resource in the [bundled resource and user databases](#user-database), download, +unarchive (where applicable) and store it in the [proper location](#where-is-the-data). -Tesseract models are single files with a `.traineddata` extension. -Tesseract expects models to be in a directory `tessdata` within what Tesseract -calls `TESSDATA_PREFIX`. When installing Tesseract from Ubuntu packages, that -location is `/usr/share/tesseract-ocr/4.00/tessdata`. When building from source -using [ocrd_all](htttps://github.com/OCR-D/ocrd_all), the models are searched -at `/path/to/ocrd_all/venv/share/tessdata`. If you want to override the -locations, you can set the `TESSDATA_PREFIX` environment variable, e.g. if you -want the models location to be `$HOME/tessdata`, you can by adding to your -`$HOME/.bashrc`: `export TESSDATA_PREFIX=$HOME`. - -We recommend you download the following models, either by downloading and -saving to the right location or by running `make install-models-tesseract` when -using `ocrd_all`: - - * [equ](https://github.com/tesseract-ocr/tessdata_fast/raw/master/equ.traineddata) - * [osd](https://github.com/tesseract-ocr/tessdata_fast/raw/master/osd.traineddata) - * [eng](https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata) - * [deu](https://github.com/tesseract-ocr/tessdata_fast/raw/master/deu.traineddata) - * [frk](https://github.com/tesseract-ocr/tessdata_fast/raw/master/frk.traineddata) - * [script/Latin](https://github.com/tesseract-ocr/tessdata_fast/raw/master/script/Latin.traineddata) - * [script/Fraktur](https://github.com/tesseract-ocr/tessdata_fast/raw/master/script/Fraktur.traineddata) - * [@stweil's GT4HistOCR model](https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/Fraktur_5000000/tessdata_fast/Fraktur_50000000.334_450937.traineddata) - -If you installed Tesseract with Ubuntu's `apt` package manager, you may want to install -standard models like `deu` or `script/Fraktur` with `apt`: +**NOTE:** The special name `*` can be used instead of a resource name/url to +download *all* known resources for this processor. To download all tesseract models: ```sh -sudo apt install tesseract-ocr-deu tesseract-ocr-script-frak +ocrd resmgr download ocrd-tesserocr-recognize '*' ``` -**NOTE:** When installing with `apt`, he `script/*` models are installed -without the `script/` prefix, so `script/Latin` becomes just `Latin`, -`script/Fraktur` becomes `Fraktur` etc. +(Note that `*` must be in quotes or escaped because of shell wildcard expansion) -OCR-D's Tesseract wrapper, -[ocrd_tesserocr](https://github.com/OCR-D/ocrd_tesserocr) and more -specifically, the `ocrd-tesserocr-recognize` processor, expects the name of the -model(s) to be provided as the `model` parameter. Multiple models can be -combined by concatenating with `+` (which generally improves accuracy but always slows processing): +## Installing unknown resources + +If you need to install a resource that OCR-D doesn't know of, than can be achieved with the `--any-url/-n` flag to `ocrd resmgr download`: + +To install a model for `ocrd-tesserocr-recognize` that is located at `https://my-server/mymodel.traineddata`. + +``` +ocrd resmgr download -n ocrd-tesserocr-recognize https://my-server/mymodel.traineddata +``` + +This will download and store the resource in the [proper location](#where-is-the-data) and create a stub entry in the +[user database](#user-database). You can then use it as the parameter value for the `model` parameter: + +``` +ocrd-tesserocr-recognize -P model mymodel +``` + +## List installed resources + +The `ocrd resmgr list-installed` command has the same output format as `ocrd resmgr list-available` but instead +of the database, it scans the filesystem locations [where data is searched](#where-is-the-data) for existing +resources and lists URL and description if a database entry exists. + +## User database + +Whenever the OCR-D/core resource manager encounters an unknown resource in the filesystem or when you install +a resource with `ocrd resmgr download`, it will create a new stub entry in the user database, which is found at +`$HOME/.config/ocrd/resources.yml` and created if it doesn't exist. + +This allows you to use the OCR-D/core resource manager mechanics, including +lookup of known resources by name or URL, without relying (only) on the +database maintained by the OCR-D/core developers. + +**NOTE:** If you produced or found resources that are interesting for the wider +OCR(-D) community, please tell us in the [OCR-D gitter +chat](https://gitter.im/OCR-D/Lobby) so we can add it to the database. + +## Where is the data + +The lookup algorithm is [defined in our specifications](https://ocr-d.de/en/spec/ocrd_tool#file-parameters) + +In order of preference, a resource `` for a processor `ocrd-foo` is searched at: + +* `$VIRTUAL_ENV/share/ocrd-resources/ocrd-foo/` +* `$HOME/.config/ocrd-resources/ocrd-foo/` +* `$HOME/.local/share/ocrd-resources/ocrd-foo/` +* `$HOME/.cache/ocrd-resources/ocrd-foo/` +* `$PWD/ocrd-resources/ocrd-foo/` + +We recommend using the `$VIRTUAL_ENV` location, which is also the default. But +you can override the location to store data with the `--location` option, which can +be `cwd`, `virtualenv`, `config`, `data` and `cache` resp. ```sh -# Use the deu and frk models -ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -p '{"model": "deu+frk"}' -# Use the script/Fraktur model -ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -p '{"model": "script/Fraktur"}' +# will download to $PWD/ocrd-resources/ocrd-anybaseocr-dewarp/latest_net_G.pth +ocrd resmgr download --location cwd ocrd-anybaseocr-dewarp latest_net_G.pth +# will download to $HOME.cache/ocrd-resources/ocrd-anybaseocr-dewarp/latest_net_G.pth +ocrd resmgr download --location cache ocrd-anybaseocr-dewarp latest_net_G.pth ``` -## Ocropy / ocrd_cis +## Changing the default resource directory -An Ocropy model is simply the neural network serialized with Python's pickle -mechanism and is generally distributed in a gzipped form, with a `.pyrnn.gz` -extension. +The `$VIRTUAL_ENV` default location is reasonable because we heavily advertise +using virtual environments and is compatible with +[ocrd_all](https://github.com/OCR-D/ocrd_all). + +However, there are use cases where the `config`/`data/`/`cache` or even the +`cwd` option should be the default (or only) location to store resources and +resolve file parameters. -Ocropy has a rather convoluted algorithm to look up models, so we recommend you -explicitly set the `OCROPUS_DATA` variable to point to the directory with -ocropy's models. E.g. if you intend to store your models in `$HOME/ocropus-models`, add the following -to your `$HOME/.bashrc`: `export OCROPUS_DATA=$HOME/ocropus-models`. +To change the default location, adapt the `$HOME/.config/ocrd/config.yml` file +(it is created if it doesn't exist whenever you execute `ocrd resmgr`) which +has a `resource_location` key that accepts the same range of values as the +`ocrd resmgr --location` command line flag. -We recommend you download the following models, either by downloading and -saving to the right location or by running `make install-models-ocropus` when -using `ocrd_all`: - * [en-default.pyrnn.gz](https://github.com/zuphilip/ocropy-models/raw/master/en-default.pyrnn.gz) - * [fraktur.pyrnn.gz](https://github.com/zuphilip/ocropy-models/raw/master/fraktur.pyrnn.gz) - * [@jze's fraktur.pyrnn.gz](https://github.com/jze/ocropus-model_fraktur/raw/master/fraktur.pyrnn.gz) (save as `fraktur-jze.pyrnn.gz`) - * [@chreul's LatinHist.pyrnn.gz](https://github.com/chreul/OCR_Testdata_EarlyPrintedBooks/raw/master/LatinHist-98000.pyrnn.gz) +## Notes on specific processors +## Ocropy / ocrd_cis + +An Ocropy model is simply the neural network serialized with Python's pickle +mechanism and is generally distributed in a gzipped form, with a `.pyrnn.gz` +extension and can be used as such, no need to unarchive. -To use a specific model with OCR-D's ocropus wrapper in [ocrd_cis](https://github.com/cisocrgroup/ocrd_cis) and more specifically, the `ocrd-cis-ocropy-recognize` processor, use the `model` parameter: +To use a specific model with OCR-D's ocropus wrapper in +[ocrd_cis](https://github.com/cisocrgroup/ocrd_cis) and more specifically, the +`ocrd-cis-ocropy-recognize` processor, use the `model` parameter: ```sh -ocrd-cis-ocropy-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-OCRO -p '{"model": "fraktur-jze.pyrnn.gz"}' +# Model will be downloaded on-demand if it is not locally available yet +ocrd-cis-ocropy-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-OCRO -P model fraktur-jze.pyrnn.gz ``` ## Calamari / ocrd_calamari Calamari models are Tensorflow model directories. For distribution, this directory is usually packed to a tarball or ZIP file. Once downloaded, these -containers must be unpacked to a directory again. - -As calamari does not have a model discovery setup, you must always provide the -path with a wildcard listing all `*.ckpt.json` ("checkpoint") files. - -We recommend you download the following model, either by downloading and -unpacking manually or by using `make install-models-calamari` if using -`ocrd_all`: - - * [@mike-gerber's GT4HistOCR model](https://qurator-data.de/calamari-models/GT4HistOCR/2019-12-11T11_10+0100/model.tar.xz) +containers must be unpacked to a directory again. `ocrd resmgr` handles this +for you, so you just need the name of the resource in the database. The Calamari-OCR project also maintains a [repository of models](https://github.com/Calamari-OCR/calamari_models). To use a specific model with OCR-D's calamari wrapper [ocrd_calamari](https://github.com/OCR-D/ocrd_calamari) and more specifically, -the `ocrd-calamari-recognize` processor, use the `checkpoint` parameter: +the `ocrd-calamari-recognize` processor, use the `checkpoint_dir` parameter: + +```sh +# To use the "default" model, i.e. the one trained on GT4HistOCR by QURATOR +ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA +# To use your own trained model +ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA -P checkpoint_dir /path/to/modeldir +# or, to be able to control which checkpoints to use: +ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA -P checkpoint '/path/to/modeldir/*.ckpt.json' +``` + +## Tesseract / ocrd_tesserocr + +Tesseract models are single files with a `.traineddata` extension. + +Since tesseract only supports model lookup in a single directory, models should +only be stored in a single location. If the default location (`virtualenv`) is +not the place you want to use for tesseract models, consider [changing the default location +in the OCR-D config file](#changing-the-default-resource-directory). + +OCR-D's Tesseract wrapper, +[ocrd_tesserocr](https://github.com/OCR-D/ocrd_tesserocr) and more +specifically, the `ocrd-tesserocr-recognize` processor, expects the name of the +model(s) to be provided as the `model` parameter. Multiple models can be +combined by concatenating with `+` (which generally improves accuracy but always slows processing): ```sh -ocrd-calamari-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-CALA -p '{"checkpoint": "/path/to/model/*.ckpt.json"}' +# Use the deu and frk models +ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -P model 'deut+frk' +# Use the Fraktur model +ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -P Fraktur ``` # Model training From a3e64cdc8558275ab68df0ee75198e53b8b745a2 Mon Sep 17 00:00:00 2001 From: Konstantin Baierer Date: Tue, 19 Jan 2021 19:10:56 +0100 Subject: [PATCH 02/20] models: note on tesseract model storage --- site/en/models.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/site/en/models.md b/site/en/models.md index 372b425fd..c2c204e41 100644 --- a/site/en/models.md +++ b/site/en/models.md @@ -212,6 +212,12 @@ only be stored in a single location. If the default location (`virtualenv`) is not the place you want to use for tesseract models, consider [changing the default location in the OCR-D config file](#changing-the-default-resource-directory). +**NOTE:** For reasons of effiency and to avoid duplicate models, all `ocrd-tesserocr-*` processors +reuse the resource directory for `ocrd-tesserocr-recognize`. + +If the `TESSDATA_PREFIX` environemnt variable is set when any of the tesseract processors +are called, it will be the location to look for resources instead of the default. + OCR-D's Tesseract wrapper, [ocrd_tesserocr](https://github.com/OCR-D/ocrd_tesserocr) and more specifically, the `ocrd-tesserocr-recognize` processor, expects the name of the From efcf5317e19c0a49592aa12c482695d731d7035f Mon Sep 17 00:00:00 2001 From: Konstantin Baierer Date: Wed, 20 Jan 2021 11:47:39 +0100 Subject: [PATCH 03/20] Update site/en/models.md Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com> --- site/en/models.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/site/en/models.md b/site/en/models.md index c2c204e41..c3266de9a 100644 --- a/site/en/models.md +++ b/site/en/models.md @@ -14,8 +14,8 @@ path to a model (calamari). Since [v2.22.0](https://github.com/OCR-D/core/releases/v2.22.0), OCR-D/core comes with a framework for managing processor resources uniformly. This means -that OCR-D/core will take care of lookin in well-defined places in the -filesystem for resources for specific processors. It also knows how to cache +that processors can delegate to OCR-D/core to resolve specific file resources by name, +looking in well-defined places in the filesystem. This also includes downloading and caching file parameters passed as a URL. OCR-D/core also comes with a bundled database of known resources, such as OCR models, configurations and other processor-specific data. This means that OCR-D users should be able to From 1346b563c4221c0a49681dc826b9a42f45aa0912 Mon Sep 17 00:00:00 2001 From: Konstantin Baierer Date: Wed, 20 Jan 2021 11:47:56 +0100 Subject: [PATCH 04/20] Update site/en/models.md Co-authored-by: Elisabeth Engl <53007946+EEngl52@users.noreply.github.com> --- site/en/models.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/site/en/models.md b/site/en/models.md index c3266de9a..25b130396 100644 --- a/site/en/models.md +++ b/site/en/models.md @@ -16,7 +16,7 @@ Since [v2.22.0](https://github.com/OCR-D/core/releases/v2.22.0), OCR-D/core comes with a framework for managing processor resources uniformly. This means that processors can delegate to OCR-D/core to resolve specific file resources by name, looking in well-defined places in the filesystem. This also includes downloading and caching -file parameters passed as a URL. OCR-D/core also comes with a bundled database +file parameters passed as a URL. Furthermore, OCR-D/core comes with a bundled database of known resources, such as OCR models, configurations and other processor-specific data. This means that OCR-D users should be able to concentrate on fine-tuning their OCR workflows and not bother with implementation From 9de53dbf16a2133081f47d13d6a51ccaf24b86c2 Mon Sep 17 00:00:00 2001 From: Konstantin Baierer Date: Wed, 20 Jan 2021 11:48:12 +0100 Subject: [PATCH 05/20] Update site/en/models.md Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com> --- site/en/models.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/site/en/models.md b/site/en/models.md index 25b130396..1d5c34516 100644 --- a/site/en/models.md +++ b/site/en/models.md @@ -17,8 +17,8 @@ comes with a framework for managing processor resources uniformly. This means that processors can delegate to OCR-D/core to resolve specific file resources by name, looking in well-defined places in the filesystem. This also includes downloading and caching file parameters passed as a URL. Furthermore, OCR-D/core comes with a bundled database -of known resources, such as OCR models, configurations and other -processor-specific data. This means that OCR-D users should be able to +of known resources, such as models, dictionaries, configurations and other +processor-specific data files. This means that OCR-D users should be able to concentrate on fine-tuning their OCR workflows and not bother with implementation details like "where do I get models from and where do I put them". From 9ceed646193781cf1ea86e6ee857c9cd1721ea82 Mon Sep 17 00:00:00 2001 From: Konstantin Baierer Date: Wed, 20 Jan 2021 11:48:25 +0100 Subject: [PATCH 06/20] Update site/en/models.md Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com> --- site/en/models.md | 1 + 1 file changed, 1 insertion(+) diff --git a/site/en/models.md b/site/en/models.md index 1d5c34516..b73340db0 100644 --- a/site/en/models.md +++ b/site/en/models.md @@ -21,6 +21,7 @@ of known resources, such as models, dictionaries, configurations and other processor-specific data files. This means that OCR-D users should be able to concentrate on fine-tuning their OCR workflows and not bother with implementation details like "where do I get models from and where do I put them". +In particular, users can reference file parameters by name now. All of the above mentioned functionality can be accessed using the `ocrd resmgr` command line tool. From c449662bab243e91dce16479fd04eaa2a8ab2ac8 Mon Sep 17 00:00:00 2001 From: Konstantin Baierer Date: Wed, 20 Jan 2021 11:48:42 +0100 Subject: [PATCH 07/20] Update site/en/models.md Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com> --- site/en/models.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/site/en/models.md b/site/en/models.md index b73340db0..aa7f2fdc9 100644 --- a/site/en/models.md +++ b/site/en/models.md @@ -50,7 +50,7 @@ ocrd-cis-ocropy-recognize ocropy historical latin model by github.com/chreul ``` -As you can see, resources are grouped by the processor they are used by. +As you can see, resources are grouped by the processors which make use of them. The word after the list symbol, e.g. `qurator-gt4hist-0.3`, `LatinHist.pyrnn.gz`, define the "name" of the resource, a shorthand you can From 5b338537d84e6fdf5594d2d810cebd3e33282573 Mon Sep 17 00:00:00 2001 From: Konstantin Baierer Date: Wed, 20 Jan 2021 11:48:57 +0100 Subject: [PATCH 08/20] Update site/en/models.md Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com> --- site/en/models.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/site/en/models.md b/site/en/models.md index aa7f2fdc9..fe53b7fd7 100644 --- a/site/en/models.md +++ b/site/en/models.md @@ -53,7 +53,7 @@ ocrd-cis-ocropy-recognize As you can see, resources are grouped by the processors which make use of them. The word after the list symbol, e.g. `qurator-gt4hist-0.3`, -`LatinHist.pyrnn.gz`, define the "name" of the resource, a shorthand you can +`LatinHist.pyrnn.gz`, defines the _name_ of the resource, which is a shorthand you can use in parameters without having to specify the full URL (in brackets after the name). From 201feb5a3626ec6b0a50f405b51e373383588cb3 Mon Sep 17 00:00:00 2001 From: Konstantin Baierer Date: Wed, 20 Jan 2021 11:49:14 +0100 Subject: [PATCH 09/20] Update site/en/models.md Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com> --- site/en/models.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/site/en/models.md b/site/en/models.md index fe53b7fd7..d44443cff 100644 --- a/site/en/models.md +++ b/site/en/models.md @@ -65,7 +65,7 @@ You can install resources with the `ocrd resmgr download` command. It expects the name of the processor as the first argument and either the name or URL of a resource as a second argument. -Likewise, model distribution is not currently centralised within OCR-D though we +Although model distribution is not currently centralised within OCR-D, we are working towards a central model repository. For example, to install the `LatinHist.pyrnn.gz` resource for `ocrd-cis-ocropy-recognize`: From 56091bc1940b686dc0b865f58a0b8c699897b317 Mon Sep 17 00:00:00 2001 From: Konstantin Baierer Date: Wed, 20 Jan 2021 11:49:35 +0100 Subject: [PATCH 10/20] Update site/en/models.md Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com> --- site/en/models.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/site/en/models.md b/site/en/models.md index d44443cff..e3c78433d 100644 --- a/site/en/models.md +++ b/site/en/models.md @@ -87,7 +87,7 @@ download *all* known resources for this processor. To download all tesseract mod ocrd resmgr download ocrd-tesserocr-recognize '*' ``` -(Note that `*` must be in quotes or escaped because of shell wildcard expansion) +(Note that `*` must be in quotes or escaped to avoid wildcard expansion in the shell.) ## Installing unknown resources From 99c6453acc66c7cf70c591050e9c28ffcbeffaa3 Mon Sep 17 00:00:00 2001 From: Konstantin Baierer Date: Wed, 20 Jan 2021 11:49:57 +0100 Subject: [PATCH 11/20] Update site/en/models.md Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com> --- site/en/models.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/site/en/models.md b/site/en/models.md index e3c78433d..f7b2e00c8 100644 --- a/site/en/models.md +++ b/site/en/models.md @@ -91,7 +91,7 @@ ocrd resmgr download ocrd-tesserocr-recognize '*' ## Installing unknown resources -If you need to install a resource that OCR-D doesn't know of, than can be achieved with the `--any-url/-n` flag to `ocrd resmgr download`: +If you need to install a resource which OCR-D doesn't know of, that can be achieved by passings its URL in combination with the `--any-url/-n` flag to `ocrd resmgr download`: To install a model for `ocrd-tesserocr-recognize` that is located at `https://my-server/mymodel.traineddata`. From b958afe871ae13306226c9b04d0321a04fe0fb58 Mon Sep 17 00:00:00 2001 From: Konstantin Baierer Date: Wed, 20 Jan 2021 11:50:13 +0100 Subject: [PATCH 12/20] Update site/en/models.md Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com> --- site/en/models.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/site/en/models.md b/site/en/models.md index f7b2e00c8..1502e02c6 100644 --- a/site/en/models.md +++ b/site/en/models.md @@ -108,7 +108,7 @@ ocrd-tesserocr-recognize -P model mymodel ## List installed resources -The `ocrd resmgr list-installed` command has the same output format as `ocrd resmgr list-available` but instead +The `ocrd resmgr list-installed` command has the same output format as `ocrd resmgr list-available`. But instead of the database, it scans the filesystem locations [where data is searched](#where-is-the-data) for existing resources and lists URL and description if a database entry exists. From d886e682bff8d871f770a427f300898193154f87 Mon Sep 17 00:00:00 2001 From: Konstantin Baierer Date: Thu, 21 Jan 2021 16:21:28 +0100 Subject: [PATCH 13/20] models: document mounting models in docker --- site/en/models.md | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/site/en/models.md b/site/en/models.md index c2c204e41..203563f6d 100644 --- a/site/en/models.md +++ b/site/en/models.md @@ -231,6 +231,31 @@ ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -P model 'deut+frk' ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -P Fraktur ``` +# Models and docker + +We recommend a two-step process to make models available in Docker. First +download all the models that you want to use on the host system. When running +the docker container, mount that local directory into the container alongside +the data you want to process. + +Download the models to `$HOME/.local/share/ocrd-resources`: + +```sh +ocrd resmgr download --location data ocrd-tesserocr-recognize eng.traineddata +ocrd resmgr download --location data ocrd-calamari-recognize default +# ... +``` + +Run the `ocrd_all` Docker container: + +```sh +docker run --user $(id -u) --workdir /data \ + --volume $PWD:/data \ + --volume $HOME/.local/cache/ocrd-resources:/ocrd-resources \ + ocrd_all ocrd-tesserocr-recognize -I IN -O OUT -P model eng +``` + + # Model training With the pretrained models mentioned above, good results can be obtained for many originals. Nevertheless, the From 6903a84d8abc5849913b2747c6f0631f13f75c76 Mon Sep 17 00:00:00 2001 From: Konstantin Baierer Date: Fri, 22 Jan 2021 17:27:01 +0100 Subject: [PATCH 14/20] Update site/en/models.md Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com> --- site/en/models.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/site/en/models.md b/site/en/models.md index c121f6970..3c7385892 100644 --- a/site/en/models.md +++ b/site/en/models.md @@ -132,15 +132,15 @@ The lookup algorithm is [defined in our specifications](https://ocr-d.de/en/spec In order of preference, a resource `` for a processor `ocrd-foo` is searched at: -* `$VIRTUAL_ENV/share/ocrd-resources/ocrd-foo/` -* `$HOME/.config/ocrd-resources/ocrd-foo/` -* `$HOME/.local/share/ocrd-resources/ocrd-foo/` -* `$HOME/.cache/ocrd-resources/ocrd-foo/` * `$PWD/ocrd-resources/ocrd-foo/` +* `$XDG_DATA_HOME/ocrd-resources/ocrd-foo/` +* `/usr/local/share/ocrd-resources/ocrd-foo/` -We recommend using the `$VIRTUAL_ENV` location, which is also the default. But +(where `XDG_DATA_HOME` defaults to `$HOME/.local/share` if unset). + +We recommend using the `$XDG_DATA_HOME` location, which is also the default. But you can override the location to store data with the `--location` option, which can -be `cwd`, `virtualenv`, `config`, `data` and `cache` resp. +be `cwd`, `data` and `system` resp. ```sh # will download to $PWD/ocrd-resources/ocrd-anybaseocr-dewarp/latest_net_G.pth From 90fbb9f35b5c85b4f1927223a64a6a32cb405293 Mon Sep 17 00:00:00 2001 From: Konstantin Baierer Date: Fri, 22 Jan 2021 17:27:16 +0100 Subject: [PATCH 15/20] Update site/en/models.md Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com> --- site/en/models.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/site/en/models.md b/site/en/models.md index 3c7385892..66f1cd909 100644 --- a/site/en/models.md +++ b/site/en/models.md @@ -159,10 +159,6 @@ However, there are use cases where the `config`/`data/`/`cache` or even the `cwd` option should be the default (or only) location to store resources and resolve file parameters. -To change the default location, adapt the `$HOME/.config/ocrd/config.yml` file -(it is created if it doesn't exist whenever you execute `ocrd resmgr`) which -has a `resource_location` key that accepts the same range of values as the -`ocrd resmgr --location` command line flag. ## Notes on specific processors From 8d4b7deded56f9dedda248a96bf0c7f95d1b3558 Mon Sep 17 00:00:00 2001 From: Konstantin Baierer Date: Fri, 22 Jan 2021 17:27:42 +0100 Subject: [PATCH 16/20] Update site/en/models.md Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com> --- site/en/models.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/site/en/models.md b/site/en/models.md index 66f1cd909..355dda8cf 100644 --- a/site/en/models.md +++ b/site/en/models.md @@ -145,8 +145,8 @@ be `cwd`, `data` and `system` resp. ```sh # will download to $PWD/ocrd-resources/ocrd-anybaseocr-dewarp/latest_net_G.pth ocrd resmgr download --location cwd ocrd-anybaseocr-dewarp latest_net_G.pth -# will download to $HOME.cache/ocrd-resources/ocrd-anybaseocr-dewarp/latest_net_G.pth -ocrd resmgr download --location cache ocrd-anybaseocr-dewarp latest_net_G.pth +# will download to /usr/local/share/ocrd-resources/ocrd-anybaseocr-dewarp/latest_net_G.pth +ocrd resmgr download --location system ocrd-anybaseocr-dewarp latest_net_G.pth ``` ## Changing the default resource directory From 65d94da1027a829cc43302d3cc9ce81fea9ee415 Mon Sep 17 00:00:00 2001 From: Konstantin Baierer Date: Fri, 22 Jan 2021 17:28:12 +0100 Subject: [PATCH 17/20] Update site/en/models.md Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com> --- site/en/models.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/site/en/models.md b/site/en/models.md index 355dda8cf..147257e2f 100644 --- a/site/en/models.md +++ b/site/en/models.md @@ -151,13 +151,13 @@ ocrd resmgr download --location system ocrd-anybaseocr-dewarp latest_net_G.pth ## Changing the default resource directory -The `$VIRTUAL_ENV` default location is reasonable because we heavily advertise -using virtual environments and is compatible with -[ocrd_all](https://github.com/OCR-D/ocrd_all). +The `$XDG_DATA_HOME` default location is reasonable because +models are usually large files which should persist across different deployments, +both native and containerized, both single-module and [ocrd_all](https://github.com/OCR-D/ocrd_all). +Moreover, that variable can easily be overridden during installation. -However, there are use cases where the `config`/`data/`/`cache` or even the -`cwd` option should be the default (or only) location to store resources and -resolve file parameters. +However, there are use cases where `system` or even `cwd` should be +used as location to store resources, hence the `--location` option. From f20205a0f0717d510037c8f660f3d581febe1423 Mon Sep 17 00:00:00 2001 From: Konstantin Baierer Date: Fri, 22 Jan 2021 17:32:59 +0100 Subject: [PATCH 18/20] Update site/en/models.md Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com> --- site/en/models.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/site/en/models.md b/site/en/models.md index 147257e2f..7ce4fafb2 100644 --- a/site/en/models.md +++ b/site/en/models.md @@ -228,7 +228,7 @@ ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -P model 'deut+frk' ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -P Fraktur ``` -# Models and docker +# Models and Docker We recommend a two-step process to make models available in Docker. First download all the models that you want to use on the host system. When running From f76c0a7b0c64fd1ddf69ed2c9f623708014712c4 Mon Sep 17 00:00:00 2001 From: Robert Sachunsky <38561704+bertsky@users.noreply.github.com> Date: Mon, 25 Jan 2021 17:58:02 +0100 Subject: [PATCH 19/20] rewrite docker model mounting section --- site/en/models.md | 46 ++++++++++++++++++++++++---------------------- 1 file changed, 24 insertions(+), 22 deletions(-) diff --git a/site/en/models.md b/site/en/models.md index 7ce4fafb2..3df1ab0ae 100644 --- a/site/en/models.md +++ b/site/en/models.md @@ -230,28 +230,30 @@ ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -P Fraktur # Models and Docker -We recommend a two-step process to make models available in Docker. First -download all the models that you want to use on the host system. When running -the docker container, mount that local directory into the container alongside -the data you want to process. - -Download the models to `$HOME/.local/share/ocrd-resources`: - -```sh -ocrd resmgr download --location data ocrd-tesserocr-recognize eng.traineddata -ocrd resmgr download --location data ocrd-calamari-recognize default -# ... -``` - -Run the `ocrd_all` Docker container: - -```sh -docker run --user $(id -u) --workdir /data \ - --volume $PWD:/data \ - --volume $HOME/.local/cache/ocrd-resources:/ocrd-resources \ - ocrd_all ocrd-tesserocr-recognize -I IN -O OUT -P model eng -``` - +We recommend keeping all downloaded resources in a persistent host directory, +separate of the `ocrd/*` Docker container and data directory, and mounting that +resource directory into a specific path in the container alongside the data directory. +The host resource directory can be empty initially. Each time you run the Docker container, +your processors will access the host directory to resolve resources, and you can download +additional models into that location using `ocrd resmgr`. + +The following will assume (without loss of generality) that your host-side data +path is under `./data`, and the host-side resource path is under `./models`: + +- To download models to `./models` in the host FS and `/usr/local/share/ocrd-resources` in Docker: + docker run --user $(id -u) \ + --volume $PWD/models:/usr/local/share/ocrd-resources \ + ocrd/all \ + ocrd resmgr download ocrd-tesserocr-recognize eng.traineddata\; \ + ocrd resmgr download ocrd-calamari-recognize default\; \ + ... +- To run processors, as usual do: + docker run --user $(id -u) --workdir /data \ + --volume $PWD/data:/data \ + --volume $PWD/models:/usr/local/share/ocrd-resources \ + ocrd/all ocrd-tesserocr-recognize -I IN -O OUT -P model eng + +This principle applies to all `ocrd/*` Docker images, e.g. you can replace `ocrd/all` above with `ocrd/tesserocr` as well. # Model training From f1ba884896192d5d7d55fa76befc8fd0e47263f6 Mon Sep 17 00:00:00 2001 From: Konstantin Baierer Date: Tue, 26 Jan 2021 12:38:39 +0100 Subject: [PATCH 20/20] Update site/en/models.md Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com> --- site/en/models.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/site/en/models.md b/site/en/models.md index 3df1ab0ae..de43c542b 100644 --- a/site/en/models.md +++ b/site/en/models.md @@ -87,7 +87,12 @@ download *all* known resources for this processor. To download all tesseract mod ocrd resmgr download ocrd-tesserocr-recognize '*' ``` -(Note that `*` must be in quotes or escaped to avoid wildcard expansion in the shell.) +**NOTE:** Equally, the special processor `*` can be used instead of a processor and a resource +to download *all* known resources for *all* installed processors: + + ocrd resmgr download '*' + +(In either case, `*` must be in quotes or escaped to avoid wildcard expansion by the shell.) ## Installing unknown resources