add 1st draft line GT/training specs by kba · Pull Request #105 · OCR-D/spec

kba · 2019-01-29T10:20:27Z

No description provided.

wrznr

I strongly recommend the introduction of a third GT subset devel.

In addition, some minor comments.

gt-profile.yml

gt-spec.md

training-schema.yml

gt-profile.yml

…as value

kba · 2019-01-31T09:07:41Z

I strongly recommend the introduction of a third GT subset devel.

@wrznr Can you elaborate?

…nGlob

wrznr · 2019-03-19T11:51:35Z

@kba

Can you elaborate?

Most training procedures allow for the application of three different sets of GT: train, eval and devel. While the purpose for the first two is supposedly clear, the latter is used during training for parameter fixing and error estimation. E.g. ocropus-rtrain has the parameter --tests for this purpose. Note that strictly speaking you may not abuse your evaluation data as development data.

wrznr · 2019-03-19T11:55:56Z

@kba Cf. https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets

cneud · 2019-05-21T23:01:08Z

@wrznr So far I mainly applied k-fold_cross-validation, would you still see added benefits over this by partitioning into three sets?

kba · 2019-05-24T10:45:07Z

model-evaluation-schema.yml

+  - groundTruthBag
+  - model
+properties:
+  engineName:


delete .Derived from model

kba · 2019-05-24T10:45:14Z

model-evaluation-schema.yml

+      - kraken
+      - tesseract
+      - calamari
+  engineVersion:


delete .Derived from model

tboenig · 2019-05-24T11:24:11Z

gt-profile.yml

+    required: false
+    default: 'image/png'
+    values:
+      - 'image/png'


Would a differentiation between Tiff compressed or JPEG2000 make more sense?

You mean additionally allow image/jp2? Do engines allow JPEG2000 input for training?

tboenig · 2019-05-24T12:13:11Z

gt-profile.yml

+BagIt-Profile-Info:
+  BagIt-Profile-Identifier: https://ocr-d.github.io/gt-profile.json
+  BagIt-Profile-Version: '1.2.0'
+  Source-Organization: OCR-D


What about information about the origin of the digitized lines?

minimal bibliographic record based on DC?

and artificially generated lines (+ degeneration)

what about the degeneration algorithm?

I think that comment may be in the wrong place here. It should probably be placed in this place ## Line metadata##.

See https://github.com/OCR-D/spec/pull/105/files/6827085d051e945062203b82ef921e54025cfbda#diff-ee256e83a17cfe309565c88ab376091a That is the definition of what's currently supposed to be in there. Bibliographic metadata would be in the METS referred to by metsUrl. How to encode provenance on a line-level I am not sure though. @VolkerHartmann?

cneud · 2019-08-08T16:47:05Z

@wrznr Do your remaining @wrznr requested changes relate to this comment only or is there other stuff that needs changing (for the time being)?

kba · 2019-08-08T17:10:39Z

@Doreenruirui's work on okralact has diverged significantly from these specs. It makes little sense to publish these specs with the only implementation implementing it differently.

@Doreenruirui can you compare your schemas and documentation with this so we can integrate that part of okralact into the specs?

Doreenruirui · 2019-08-08T17:25:32Z

@Doreenruirui's work on okralact has diverged significantly from these specs. It makes little sense to publish these specs with the only implementation implementing it differently.

@Doreenruirui can you compare your schemas and documentation with this so we can integrate that part of okralact into the specs?

@kba I am sorry that I am not very familiar with github. Can you point me to the document I should compare with my schemas?

kba · 2019-08-08T17:30:31Z

@Doreenruirui We're discussing these changes/new files: https://github.com/OCR-D/spec/pull/105/files.

In particular I would like to harmonize the proposal here (https://github.com/OCR-D/spec/pull/105/files?file-filters%5B%5D=.md#diff-2ae93b1f468c44b9f7e195133a0fb539) of using BagIt for the line GT with your approach in okralact wrt to input format.

Also interesting would be to compare okralact's engine schemas with the schema proposed here https://github.com/OCR-D/spec/pull/105/files?file-filters%5B%5D=.md&file-filters%5B%5D=.yml#diff-690d5874f98dfbd6737bc0168b6084d8 and https://github.com/OCR-D/spec/pull/105/files?file-filters%5B%5D=.md&file-filters%5B%5D=.yml#diff-a1f62fd4dd219fc5c5d5f0ccb419c88b

My original review does not relate to the current state very much.

add 1st draft line GT/training specs

d02d78f

kba requested review from VolkerHartmann, tboenig and wrznr January 29, 2019 10:20

kba mentioned this pull request Jan 29, 2019

Metadata for OCR models and/or OCR model training sets #86

Open

wrznr previously requested changes Jan 29, 2019

View reviewed changes

wrznr added the enhancement label Jan 29, 2019

mittagessen reviewed Jan 29, 2019

View reviewed changes

gt-profile.yml Outdated Show resolved Hide resolved

kba and others added 6 commits January 31, 2019 09:16

make transcription normalization required but allow 'non-normalized' …

73dc4c4

…as value

clarify that all transcriptions must be unicode/utf8

ab7b6e0

fix image extensions in gt-spec to fit gt-profile

00fa0aa

typos

d0dcab0

forbid jpeg for bitonal line images

b91c9a7

preliminary media type for tesseract >= 4 models

9300b25

kba added 10 commits January 31, 2019 10:52

single-line metadata: add teiUrl

2ae7d57

make coords in single-line schema optional

b348592

allow build.sh files in bags

5b3d6df

recognition schema for evaluating training progress

c8bdc85

model-validation-schema: measures

b706d82

rename model-validation -> model-evaluation

c9aeb4a

Gt-Prediction-* for linegt profile

c38a7a6

differentiate trainerArgs and recognizerArgs

86761a6

groundTruthGlob -> trainingGlob, evaluation -> validation, +evaluatio…

6a9e00b

…nGlob

engineArguments -> recognizerArguments

6827085

kba commented May 24, 2019

View reviewed changes

model-evaluation-schema.yml

- groundTruthBag

- model

properties:

engineName:

Copy link

Member Author

kba May 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete .Derived from model

kba commented May 24, 2019

View reviewed changes

model-evaluation-schema.yml

- kraken

- tesseract

- calamari

engineVersion:

Copy link

Member Author

kba May 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete .Derived from model

tboenig reviewed May 24, 2019

View reviewed changes

cneud removed the request for review from VolkerHartmann August 15, 2022 19:20

Conversation

kba commented Jan 29, 2019

Uh oh!

wrznr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kba commented Jan 31, 2019

Uh oh!

wrznr commented Mar 19, 2019

Uh oh!

wrznr commented Mar 19, 2019

Uh oh!

cneud commented May 21, 2019

Uh oh!

kba May 24, 2019

Choose a reason for hiding this comment

Uh oh!

kba May 24, 2019

Choose a reason for hiding this comment

Uh oh!

tboenig May 24, 2019

Choose a reason for hiding this comment

Uh oh!

kba May 24, 2019

Choose a reason for hiding this comment

Uh oh!

tboenig May 24, 2019

Choose a reason for hiding this comment

Uh oh!

kba May 24, 2019

Choose a reason for hiding this comment

Uh oh!

cneud commented Aug 8, 2019

Uh oh!

kba commented Aug 8, 2019

Uh oh!

Doreenruirui commented Aug 8, 2019

Uh oh!

kba commented Aug 8, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants