-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
fix: hyphenation #3267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: hyphenation #3267
Conversation
|
ca8c77a to
fea3867
Compare
|
@diegomura sorry about the broken tests. However, your fix here makes it a lot more complicated for users to set up proper hyphenation, not simpler. Please let me explain. The goal of any reasonably-sized application in a western language is:
Requirement 1. has always been possible and still is. Requirement 2. was impossible in the past, was made possible with #3188 and now made much harder in this PR #3267. To properly implement this in #3188, only the (word, original) =>
original(word).flatMap(part => part.split(/(?<=-/)/))I.e. no knowledge of soft hyphens is effectively required, it's all already contained in the Now with this PR #3267, to implement the hyphen-less word breaking, users need to pre-insert soft hyphens into ALL of their strings in the whole application. That is, every single I chose the soft hyphen character because it is intended for almost exactly this use case. pdfkit itself already has support for soft hyphens, but it is never used inside react-pdf because the layout and textkit packages perform line breaking, and pdfkit's own line breaking capabilities are never triggered. That's why we have to remove all soft hyphens from the text before passing it to pdfkit. If you were willing to reconsider my solution or a similar solution, I'd be happy to have a look at how I broke the emoji and superscript tests, as well as the issue with multiple breaks in the same word. Please let me know about your opinion on this. |
|
@carlobeltrame nothing to be sorry about! And I'll always be open to reconsider solutions :) Consider this change mostly as a way to fix tests and publish a new version yesterday, but doesn't have to stay like this if it's not ideal. This was a breaking change anyways Let me break down your response though as some things I do not get. About Requirement 1, I agree was possible but was broken due to soft-hyphens persisting in attributed string but not on syllables as far as I understand. About Requirement 2, I was not considering them and you have a good point. But do we know what these cases are? For things like
My general thought about this is that if users are supposed to set this snippet every time, it shuold be built in in the lib.
This part I don't get 😄 Why is this, and why is it different than #3188 ? As far as I can tell, if the passed string has soft-hyphens, those will be used by the hyphenator, and if not they will be computed by
This I dont get either. How it's suposed to work now and I as far as I can tell does is:
Again, I get requirement 2 is not being fixed here but I'm I suspect those can be workaround this solution.
I surely agree snapshots are great, and those particularly are helpful. We have some sort of e2e tests with image regression testing, which kinda makes sense because they allow for internals to change while just ensuring the output remains the same, which is ultimately what the user cares about. But avoiding huge JSON snapshots that aren't really readable. Obviously they don't cover everything (like bookmarks, links, etc) but those are tested. elsewhere anyways. What I don't really liked about those though is they are testing things we shouldn't. Ex. they contained the serialized fonts, images (in case they have attachements) or any other internal thing that isn't really relevant to test on for layout. Imo those are essentially unreadable. When they will break (which they will, probably in a false way, given how many internal things they capture), given the difficulty to see diffs we will just hit Let me know what you think. Again, I'm very much in favor to have as many tests as we can, but they should be somehow manageable so tests do not get in the way (been there many times). I don't think those were :) |
|
I'll try to explain more, and give you proposed documentation for both versions, which will demonstrate what exactly is possible with #3188 and not possible with #3267.
I think this requirement has always been possible to a satisfying degree. I would ignore soft hyphens here, because most user-provided and developer-provided text does not contain soft hyphens.
I have listed many issues in the description of this PR. Specifically, the issues #1642, #2456, #2564, #1380, #1416 and #1662 all are solveable with my version #3188, but are not solvable with #3267 (see below the documentation proposals for details).
I agree that would be ideal, but applications might have different views on which characters or character combinations should be allowed to break. E.g. should we break on underscores
I think I was wrong in this point, in reality it's even worse. With this new PR #3267, hyphen-less word breaking is not possible anymore (or if it is, I may misunderstand your changes). See below the documentation proposals for a clearer explanation. After #3188, the docs could look like this:
After #3267, the docs would rather look something like this:
@diegomura if I misunderstand your changes, please let me know. For full support of both requirements, the hyphenation callback MUST have a way to specify, which split parts should be followed by an extra hyphen if broken, and which split parts should not. In #3188, I introduced this possibility via the soft hyphen character. As far as I can tell, with this PR #3267, you removed this capability again, breaking the solution for many of the issues linked above. If I understood this wrong, I'd be glad if you could show me a code example on how to acheive the hyphen-less word splitting with your version. |
I agree. I myself am using react-pdf in a special way without the react renderer package, with a Vue.js renderer that generates the input for the layout stage (so, react-pdf with Vue.js instead of React). So that is a reason why I am especially interested in stability of these internal representations. I understand if you cannot currently make stability there a part of the maintenance effort.
Maybe the specific bytes of the fonts aren't interesting, but if the fonts unexpectedly change or vanish or move to a different place in the internal representation object, would you want to know about it? I think these fonts rarely ever change, so having them in the snapshot shouldn't really make the snapshot test fail often, right? Or on the other hand, if you think these fonts don't really belong to the state worth testing, maybe they shouldn't be this deeply nested inside the internal representation object in the first place?
Well, snapshots are specifically a tool for when the output of an algorithm is too large to be readable. But I understand if you don't want them. Another way forward would be to restore the integration test, then reduce the internal state by recursively throwing away undesired contents such as the fonts, and then snapshot this reduced version, or even write manual assertions for it. This approach may or may not lead to less maintenance of this integration test, who knows... I didn't do this yet because the internal representation is extremely complex and I wouldn't know which parts to keep and which to ignore.
|
By "hyphen-less" word breaking here you mean essentially Requirement 2 right? Allowing some words (like urls) to break, but not add a hyphen char at the end of the line. If so, I agree #3267 dropped this ability and we need it back. How to get there I feel is where we might have different opinions here :)
Lots of examples you list here are already possible with #3267, like not breaking a company name (for which hyphenation custom fn shuold just return 1 syllable. ex Again, and correct me if I'm wrong, the question is how we give users control to break in a specific place without engine adding a hyphen at the end. Let's say we find a way to achieve this, would you think something else is missing?
I feel this is an incorrect use of soft-hyphen, and probably what got me confused about #3188 in the first place. Soft-hyphen unicode char purpose is to mark where a long word may be broken across lines if line-wrapping is needed, not to flag the engine wether a visible hyphen should be added or not. From what I get from your last part of your response you acknowledge this but you are fine with it. Going back to my original point of #3188 exposing soft-hyphen as a public api (1st sub-point of the PR description) to which you said it didn't, after reading your potential docs I believe even stronger that it does. Not only that but we (react-pdf) are adding a custom meaning to soft-hyphens which is non-standard (at least to my knowledge). Now if users do not add this somehow mysterious (to their eyes) unicode char at the end of each syllable they won't see the line wrapped like they expect 99% of the time which is with a In general I wanna think on maximizing DX for the common case, which in this case is hyphenating with a So how to still support Requirement 2? I can think on some solutions: 1. Allow to set hyphenation char for each word, optionally empty: Pros: semantic, backwards compatible (we can still support returning an array of strings for the most common cases) 2. Accept unicode for non hyphenated words: Similar approach but opposite direction: user doesn't want a Pros: Simpler return type of hyphenation callback (not sure) -- Regardless, I think things like URLs where common case is not to add @carlobeltrame thoughts? |
Agreed.
Yes, these are already solvable with #3267, but I wasn't listing impossible cases here, I was listing reasons why react-pdf can't ship with the one and only perfect hyphenation algorithm for all use cases. On the other hand, the GitHub issues #1642, #2456, #2564, #1380, #1416 and #1662 depend on requirement 2 to be solved. I think we are on the same page here.
Yes, agreed, this is the only feature that is effectively missing. Once we have a solution for this, we're good.
Well, according to the Wikipedia text, it's "for the purpose of breaking words across lines by inserting visible hyphens if they fall on the line end but remain invisible within the line". The
As detailed in the documentation proposal, most devs (for english language apps, or for apps using the npm hyphen library for syllable splitting) wouldn't need to worry about inserting the soft hyphens themselves, because react-pdf and the hyphen library already insert it for them. But of course, more explicit might be more clear here.
Multiple different hyphen chars inside the same word seems essential to me. As an example, the compound term "yellowish-blue" must be hyphenated as Font.registerHyphenationCallback(
(word, syllables) => syllables(word).flatMap(({ syllable, hyphen }) => {
const parts = syllable.split(/(?<=[-./])/)
return parts.map((part, i) => ({ syllable: part, hyphen: (i === parts.length - 1) ? hyphen : '' }))
})
)Technically, another advantage of this approach is that it meets more closely the feature asked for in #2456 and #1416, although I suspect nobody actually wants an end-of-line character other than
Yes, seems similar to my proposal, with similar downsides. For completeness sake, here is an implementation of a custom hyphenation callback in this scenario, as it might look in the future react-pdf documentation: const ZERO_WIDTH_SPACE = '\u200B'
Font.registerHyphenationCallback(
(word, syllables) => syllables(word).flatMap(part => part.split(/(?<=[-./])/).map(p => p + ZERO_WIDTH_SPACE))
)
As a dev using react-pdf, I welcome good defaults where appropriate. But I'd prefer customizability over opinionated defaults. As detailed above, I don't think react-pdf can and should handle all the eventualities. I would rather focus on offering the devs the tools to implement the domain-specific correct rules (and also override the default rules for URLs if possible!) |
|
Thanks for the quick responses here @carlobeltrame :)
I'm not sure I see why. Soft hyphen passed into a
Fair. And kinda agree. I don't think we should add defaults for a bunch of things. Just that we can add some of these handy defaults to the built-in hyphenation function so less people have the need to implement their own custom fn. Many people will still need, and for those we should give them all tools needed to customize hyphenation as they want. So about "But I'd prefer customizability over opinionated defaults", both are possible from my point of view. But happy to put a pin on defaults at the moment.
Agree. For those wanting fine control over hyphenation, code will be more complex. But it's also important to note that for those that don't need it, code will remain simpler. If we go for this spec, all these should be possible: No special handling: Font.registerHyphenationCallback(word =>
hyphen(word).split(SOFT_HYPHEN)
)Define hyphen at word level: Font.registerHyphenationCallback(word => (
{ syllables: hyphen(word).split(SOFT_HYPHEN)), hyphen: isUrl(word) ? '' : '-' }
))All control: Font.registerHyphenationCallback(word =>
hyphen(word).split(SOFT_HYPHEN).map(syllable => (
{ syllable, hyphen: '<computed>' }
)
))I like that it get's more complex as you gain more control, which kinda makes sense. re/ I also thought about #2456 and #1416, and if the goal is to give users all control in the word, it's a very nice to have. About implementation, it should be very straightforward as we can internally normalize to I'm the one now that can hear you saying that with trailing soft-hyphens you can control all of these with a single API 😄 And agree. But going back to my initial point, we would also be penalizing simple requirements users DX (not mentioning won't be possible to change Something important to me as well about implementation, there's no need at all to recompute all glyphs to remove soft-hyphens. I did not mentioned it before as it felt less important, but I didn't really liked this import (conceptually speaking, I don't want engines importing layout steps, the dependency between these is the opposite), plus any perf issue this might carry. Textkit is already too complex and sometimes expensive to have "re-layouts" internally if they can be avoided I'm not saying it's perfect, but measuring pros and cons I feel it's the best path to follow given the info I have. Still, I value your opinion and thoughts, so please throw what you think about it |
Regression from: #3188
#3188 broke emojis and text super/sub rendering.
After a more careful review, I changed some stuff. In short:
Also previous approach was slightly wrong as it did not hyphenate the same word more than once if needed
@carlobeltrame please take a look if you have a sec and report anything wrong :)