Conversation
filter out binarized images (independent of the workflow), to improve segmentation quality
Codecov Report
@@ Coverage Diff @@
## master #144 +/- ##
==========================================
+ Coverage 37.73% 37.77% +0.04%
==========================================
Files 9 9
Lines 1023 998 -25
Branches 216 212 -4
==========================================
- Hits 386 377 -9
+ Misses 565 555 -10
+ Partials 72 66 -6
Continue to review full report at Codecov.
|
|
This needs to be tested systematically. I expect to see both degradation and improvement, depending on how hard binarization is. See here for explanation. |
kba
left a comment
There was a problem hiding this comment.
I understand the reasoning, subscribing to tesseract-ocr/tesseract#3083 for the discussion on upstream changes. Changeset (filtering binarized) is sensible but needs good testing to ensure that it is more beneficial than detrimental, or perhaps should be parameterizable.
I thought about that, but at workflow configuration time, you have next to no chance of knowing which is going to be better. (I would guess that only input images which fare well under global Otsu are better off with the change. But we have no automatic indicator of binarization quality yet. In the very least, we should strive for some estimator based on local distribution of connected component statistics.) But I still hope that we can fix the problem in Tesseract itself. |
d231edb to
2b3e8d6
Compare
No description provided.