Skip to content

Suppress Table Lines and Graphics #13

@apsexton

Description

@apsexton

Identify reasonable character sizes by considering the distribution of sizes of bounding boxes of the connected components in the page: the smallest CCs will be either noise, full stops or the points above lower case i and j letters. As the size increases, we find first the smallest font size characters, then larger and larger characters, then large fence symbols (braces, parentheses, etc,) and division lines, then graphics (diagrams, plots) and table lines. Since characters occur much more often in the page than the larger CCs, one can easily find a size limit to distinguish characters from non-characters.

Identify any connected components that are significantly wider and taller than reasonable character sizes under the assumption that these are either table lines or graphics. For the moment, we will simply suppress these objects: i.e. not use them in our layout analysis but also not allow them to interfere with the rest of the analysis. Later on we will return to tables and integrate them into our analysis. For the moment, simply draw the bounding boxes of the identified table lines or diagrams in a different colour to those of character CCs.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions