-
Notifications
You must be signed in to change notification settings - Fork 32
Closed
Labels
Description
There is a regression in 84a4e1a: When passing multiple pages for an image-only input fileGrp, e.g. -g phys_0001,phys_0007 -I OCR-D-IMG, now the logic that tries to prevent mixing derived images with original images is falsely triggered:
core/ocrd/ocrd/processor/base.py
Lines 118 to 125 in edf31fa
| ret = self.workspace.mets.find_all_files( | |
| fileGrp=self.input_file_grp, pageId=self.page_id, mimetype="//image/.*") | |
| if self.page_id and len(ret) > 1: | |
| raise ValueError("No PAGE-XML %s in fileGrp '%s' but multiple images." % ( | |
| "for page '%s'" % self.page_id if self.page_id else '', | |
| self.input_file_grp | |
| )) | |
| return ret |
The problem is that self.page_id here is actually a list (formatted in comma-join notation).
So the correct way of ensuring that no single page gets multiple image file results is by
- either disallowing
find_all_filesto aggregate them like this (which is probably valid in other contexts, though) - or going through its result
retand checking whether any of itspageIds repeat:
page_ids = [file.pageId for file in ret]
if len(page_ids) != len(set(page_ids)):Reactions are currently unavailable