Skip to content

Quality improvement by searching for common OCR errors (transferred from OL) #97

@RayBB

Description

@RayBB

Original text from internetarchive/openlibrary#810:

Sorry if this is out of place, but I just stumbled across an oddity. It appears that the Google-digitized non-English editions have some habitual problems in the OCR which shows up in the boilerplate they inserted.

For instance, Googling: "carcfully scannod" site:archive.org
turns up 46,900 results, most of which are scanned from texts in languages that use diacritics. That can't be a coincidence. I'm wondering if it can be put to use for quality improvement. Might they just need a fresh run through OCR with more modern software?

More discussion in the thread.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions