Compensate for offset of PDF CropBox in Adobe OCR JSON format #9
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The bottom-left corner of a PDF as shown in the Labelbox labeling UI is not necessarily (0,0) in the PDF coordinate system. If the PDF page has a CropBox that's smaller than its MediaBox, the apparent (visual) origin can be greater than (0,0). Furthermore, it's also possible for the entire MediaBox to be offset from (0,0).
This PR adjusts the Adobe OCR JSON format transformer tool to account for the offset of each page's CropBox.
The code change itself should be pretty self-explanatory, but in case the concepts here are unclear I've attached a ZIP with test data for three different scenarios: Adobe-Coords-Test-Data.zip
In the first case, the MediaBox starts at (0,0) but the CropBox is inset and starts at (25,25). An excerpt of the resulting JSON from Adobe's API follows.
In the second case, the MediaBox is inset to (25,25) and the CropBox is coterminous with the same.
In the third case, the MediaBox is inset to (10,10) and the CropBox is further inset by (+15,+15) to (25,25).
(Please pardon some inexactness in the numbers; I had to do some of the cropping by eye.)
The current behavior of the transformer tool (top) causes the resulting text layer to be offset when it is displayed within the Labelbox labeling UI. With the change in this PR, the text layer is correctly positioned in all three cases (bottom).

Note also that, because the CropBox offsets in my test files are relatively small, the incorrectly offset text layer is at least visible in the labeling UI, but if the offsets are quite large the text layer may be shifted entirely offscreen and not visible at all. This is how I initially noticed the bug - I was feeding text layers into Labelbox but they didn't show at all in the UI.