Compensate for offset of PDF CropBox in Adobe OCR JSON format #9

okofish · 2023-08-27T16:59:51Z

The bottom-left corner of a PDF as shown in the Labelbox labeling UI is not necessarily (0,0) in the PDF coordinate system. If the PDF page has a CropBox that's smaller than its MediaBox, the apparent (visual) origin can be greater than (0,0). Furthermore, it's also possible for the entire MediaBox to be offset from (0,0).

This PR adjusts the Adobe OCR JSON format transformer tool to account for the offset of each page's CropBox.

The code change itself should be pretty self-explanatory, but in case the concepts here are unclear I've attached a ZIP with test data for three different scenarios: Adobe-Coords-Test-Data.zip

In the first case, the MediaBox starts at (0,0) but the CropBox is inset and starts at (25,25). An excerpt of the resulting JSON from Adobe's API follows.

"pages": [
    {
        "boxes": {
            "CropBox": [
                24.983993530273438,
                24.983993530273438,
                275.0159912109375,
                175.01600646972656
            ],
            "MediaBox": [
                0.0,
                0.0,
                300.0,
                200.0
            ]
        },
        "height": 150.03201293945312,
        "is_scanned": false,
        "page_number": 0,
        "rotation": 0,
        "width": 250.03199768066406
    }
]

In the second case, the MediaBox is inset to (25,25) and the CropBox is coterminous with the same.

"pages": [
    {
        "boxes": {
            "CropBox": [
                26.0,
                26.0,
                275.0,
                175.0
            ],
            "MediaBox": [
                26.0,
                26.0,
                275.0,
                175.0
            ]
        },
        "height": 149.0,
        "is_scanned": false,
        "page_number": 0,
        "rotation": 0,
        "width": 249.0
    }
]

In the third case, the MediaBox is inset to (10,10) and the CropBox is further inset by (+15,+15) to (25,25).

"pages": [
    {
        "boxes": {
            "CropBox": [
                24.975997924804688,
                23.975997924804688,
                276.02398681640625,
                175.0240020751953
            ],
            "MediaBox": [
                10.0,
                9.0,
                291.0,
                190.0
            ]
        },
        "height": 151.04800415039062,
        "is_scanned": false,
        "page_number": 0,
        "rotation": 0,
        "width": 251.04798889160156
    }
]

(Please pardon some inexactness in the numbers; I had to do some of the cropping by eye.)

The current behavior of the transformer tool (top) causes the resulting text layer to be offset when it is displayed within the Labelbox labeling UI. With the change in this PR, the text layer is correctly positioned in all three cases (bottom).

Note also that, because the CropBox offsets in my test files are relatively small, the incorrectly offset text layer is at least visible in the labeling UI, but if the offsets are quite large the text layer may be shifted entirely offscreen and not visible at all. This is how I initially noticed the bug - I was feeding text layers into Labelbox but they didn't show at all in the UI.

Compensate for offset of PDF CropBox in Adobe OCR JSON format

4560912

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compensate for offset of PDF CropBox in Adobe OCR JSON format #9

Compensate for offset of PDF CropBox in Adobe OCR JSON format #9

Uh oh!

okofish commented Aug 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Compensate for offset of PDF CropBox in Adobe OCR JSON format #9

Are you sure you want to change the base?

Compensate for offset of PDF CropBox in Adobe OCR JSON format #9

Uh oh!

Conversation

okofish commented Aug 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant