Skip to content

Invalid structMap produced #1195

@mikegerber

Description

@mikegerber

This might be a bug in ocrd-cis actually, so beware.

We encountered a number of problems elsewhere due to an invalid physical structMap. Here, I managed to reproduce with the latest ocrd:all/maximum Docker image, with the following steps:

  1. Starting with the workspace here: https://ub-backup.bib.uni-mannheim.de/~stweil/quiver-benchmark/workflows/workspaces/reichsanzeiger_random_selected_pages_ocr/data/reichsanzeiger_random/
  2. I removed all filegroups except OCR-D-IMG and OCR-D-GT-SEG-LINE, using ocrd workspace remove-group -rf. → After this, the structMap is OK!
  3. Then I ran ocrd-cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN → After this, the structMap is INVALID

Invalid structMap (multiple divs with same ID) after step 2, shortened to one physical page for emphasis:

  <mets:structMap TYPE="PHYSICAL">
    <mets:div TYPE="physSequence">
      <mets:div TYPE="page" ID="P_1879_45_0344">
        <mets:fptr FILEID="OCR-D-IMG_1879_45_0344"/>
        <mets:fptr FILEID="OCR-D-GT-SEG-LINE_1879_45_0344"/>
      </mets:div>

       ...

      <mets:div TYPE="page" ID="P_1879_45_0344">
        <mets:fptr FILEID="OCR-D-BIN_1879_45_0344.IMG-BIN"/>
      </mets:div>
  
       ...

    </mets:div>
  </mets:structMap>

(I'll upload the full data in the comments)

This causes all kind of breakage all over the place.

What I didn't check yet: if this only breaks with ocrd_cis, maybe @bertsky can share his debugging efforts here. I first had the impression that this breaks with add too, but as I had tried to reproduce a problem encountered by @stweil in OCR-D/quiver-benchmarks#22 it could have always been in ocrd_cis (specific workflow uses this as first step) and I could have easily confused something.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions