Skip to content

Conversation

@olivetthered
Copy link

Basic implementation of importing text from pdf files.
No fonts or styling yet, or second passes over grouping text areas together etc... just some basic text area grouping and layout but enough for additional features to be implemented fairly independently of each other. Currently, only support single page, select import text as text option (as opposed to the default import text as vector) in the GUI when importing a vector file of pdf type to import the text from a pdf file as text.

imnport text from a pdf document with some fuzzy matching to put lines of text that appear to be;long together in the same textframe. layout is good but there's no font or styling support as of yet and rotated text isn't supported either. creats lots of text boxes if the pdf file reports lots of text regions, they also need joining up in a second pass to merge textregions that should be together regardlesds of what the pdf file is reporting.
UI for selecting  text import as either vectors (dewfault) or as text. There will need to be some more variables for text import so the user can configure how loose or strict the text block matching is as I doub't even with good guesses it won't be a one size fits all solution.
@olivetthered
Copy link
Author

Pending file review by ale

@olivetthered
Copy link
Author

I raised the following bug to have this pull request reviewed and integrated:
https://bugs.scribus.net/view.php?id=16142

implement text import as a new outputdev inheriting slaOutputdev and making the appropriate private members of slaOutptutDev protected
tidy up so we make minimul changes from master
fixed some space differences with master
override type3 font output as we don't want to get confused and try to render them as vectors when vector rendering is only partially functional due to overrides from slaoutputdev. Hopefully they can be implemneted in the same way as addChar but if that turns out to be infeasable the overrtides can be removed and they can get rendered as vectors in the finished implementation.
…taken

change the name of TextOutputDev to PdfTextOutputDev as it's already taken
the PdfTextOutputDev naming matches tjhe naming of PdfTextRecognition
…varialbes

to make the classes and memb ers iuniform accrtoss the pdfTextRecognition implementation remane all the classes and member variables and function so they start with pdf ext unless it's not appropriate.
moved the optpuit dev into the pdftextrecognition files meaning slaoutput dev files longer have any dependencies on pdftextrecognition. This now keeps things neet and tody and a;l together.
fix z-order/grouping. I don't know why I did this in the first place
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant