Add delimiter option to as_text #3

gryftir · 2014-04-06T00:34:42Z

added delimiter option to as_text to provide for a user supplied
delimiter. This allows users to better parse text from a website.
POD has been modified to reflect this change.

added delimtier option to as_text to provide for a user supplied delimiter. This allows users to better parse text from a website. POD has been modified to reflect this change.

madsen · 2014-05-17T16:44:41Z

I'm not convinced that this is a good idea (even without the delimiter => "0" bug). You're putting a delimiter after every text node, but the tree is allowed to have multiple text nodes in a row unless you've called normalize_content. You'd also be putting a delimiter after inline tags where it's probably not wanted, like <b> or <span>.

I think you'd be better off telling people to do something like:

my $text = join("\n", map { $_->as_text } $h->look_down(_tag => 'p'));

with the look_down condition depending on exactly what you're trying to do.

gryftir · 2014-05-18T01:10:41Z

While I am not certain that my suggestion is the best possibility, I think a new user's expectation for as_text is that it won't remove spaces between words, even if those spaces are handled by the style sheets. For example, if you

Obviously we can't parse style sheets, so that's problematic. That as_text does sometimes combine words makes it hard to do things like word counts or parsing.

My suggestion would add an option that largely deals with the issue, given that tags, inline or otherwise, usually wrap whole words. You suggest the following solution.

my $text = join("\n", map { $_->as_text } $h->look_down(_tag => 'p'));

I see what you are saying, and for complex parsing, I'd probably do something like you suggest if I was using the current API, differentiating between paragraphs and divs and inline tags. But it would be more complex if you are looking at a whole document, or multiple heterogenous documents, because you have to consider anything that could contain text.

Maybe instead I should add an as_delimited_text method, which allows you to pass a delimiter and a subroutine to make decisions about the insertion of a delimiter, or if no subroutine is passed, adds the delimiter as my current function does?

As things currently stand though, it's not obvious to a module user that they should use look_down to deal with as_text combining words.

madsen · 2014-05-18T01:18:40Z

You seem to have omitted an example at the end of your first paragraph.

gryftir · 2014-05-18T03:34:10Z

Sorry that should have had https://gist.github.com/gryftir/2b15c2d62c70d8431e4c

You can see that you get things like GoogleSearch instead of Google Search.

toddr · 2017-08-19T03:12:28Z

You might want to close this and re-submit at https://github.com/kentfredric/HTML-Tree/pulls

modified: lib/HTML/Element.pm

dee525c

added delimtier option to as_text to provide for a user supplied delimiter. This allows users to better parse text from a website. POD has been modified to reflect this change.

madsen changed the title ~~modified: lib/HTML/Element.pm~~ Add delimiter option to as_text May 18, 2014

madsen changed the title ~~Add delimiter option to as_text~~ Add delimiter option to as_text May 18, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add delimiter option to as_text #3

Add delimiter option to as_text #3

Uh oh!

gryftir commented Apr 6, 2014

Uh oh!

madsen commented May 17, 2014

Uh oh!

gryftir commented May 18, 2014

Uh oh!

madsen commented May 18, 2014

Uh oh!

gryftir commented May 18, 2014

Uh oh!

toddr commented Aug 19, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add delimiter option to as_text #3

Are you sure you want to change the base?

Add delimiter option to as_text #3

Uh oh!

Conversation

gryftir commented Apr 6, 2014

Uh oh!

madsen commented May 17, 2014

Uh oh!

gryftir commented May 18, 2014

Uh oh!

madsen commented May 18, 2014

Uh oh!

gryftir commented May 18, 2014

Uh oh!

toddr commented Aug 19, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants