Skip to content

Conversation

@gryftir
Copy link

@gryftir gryftir commented Apr 6, 2014

added delimiter option to as_text to provide for a user supplied
delimiter. This allows users to better parse text from a website.
POD has been modified to reflect this change.

added delimtier option to as_text to provide for a user supplied
delimiter.  This allows users to better parse text from a website.
POD has been modified to reflect this change.
@madsen
Copy link
Owner

madsen commented May 17, 2014

I'm not convinced that this is a good idea (even without the delimiter => "0" bug). You're putting a delimiter after every text node, but the tree is allowed to have multiple text nodes in a row unless you've called normalize_content. You'd also be putting a delimiter after inline tags where it's probably not wanted, like <b> or <span>.

I think you'd be better off telling people to do something like:

my $text = join("\n", map { $_->as_text } $h->look_down(_tag => 'p'));

with the look_down condition depending on exactly what you're trying to do.

@gryftir
Copy link
Author

gryftir commented May 18, 2014

While I am not certain that my suggestion is the best possibility, I think a new user's expectation for as_text is that it won't remove spaces between words, even if those spaces are handled by the style sheets. For example, if you

Obviously we can't parse style sheets, so that's problematic. That as_text does sometimes combine words makes it hard to do things like word counts or parsing.

My suggestion would add an option that largely deals with the issue, given that tags, inline or otherwise, usually wrap whole words. You suggest the following solution.

my $text = join("\n", map { $_->as_text } $h->look_down(_tag => 'p'));

I see what you are saying, and for complex parsing, I'd probably do something like you suggest if I was using the current API, differentiating between paragraphs and divs and inline tags. But it would be more complex if you are looking at a whole document, or multiple heterogenous documents, because you have to consider anything that could contain text.

Maybe instead I should add an as_delimited_text method, which allows you to pass a delimiter and a subroutine to make decisions about the insertion of a delimiter, or if no subroutine is passed, adds the delimiter as my current function does?

As things currently stand though, it's not obvious to a module user that they should use look_down to deal with as_text combining words.

@madsen madsen changed the title modified: lib/HTML/Element.pm Add delimiter option to as_text May 18, 2014
@madsen madsen changed the title Add delimiter option to as_text Add delimiter option to as_text May 18, 2014
@madsen
Copy link
Owner

madsen commented May 18, 2014

You seem to have omitted an example at the end of your first paragraph.

@gryftir
Copy link
Author

gryftir commented May 18, 2014

Sorry that should have had https://gist.github.com/gryftir/2b15c2d62c70d8431e4c

You can see that you get things like GoogleSearch instead of Google Search.

@toddr
Copy link

toddr commented Aug 19, 2017

You might want to close this and re-submit at https://github.com/kentfredric/HTML-Tree/pulls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants