Improve "r_unescape" regular expression to skip invalid HTML entities #98

stkao05 · 2015-03-30T17:09:00Z

Some invalid HTML entities (ex: &#a;) are still being matched by the regular expression r_unescape, which result in error

Example scenario

html = "<html><body><input name='opt in for&#a;todoist.com&#a;new site' /><p>hihi</p><body></html>"

plaintext = html2text.html2text(html)

Error traceback:

  File "todoist/scripts/test.py", line 16, in <module>
    plaintext = html2text.html2text(html)
  File "/home/vagrant/todoist/libs/ist_libs/python/html2text.py", line 812, in html2text
  File "/home/vagrant/todoist/libs/ist_libs/python/html2text.py", line 252, in handle
  File "/home/vagrant/todoist/libs/ist_libs/python/html2text.py", line 249, in feed
  File "/usr/lib/python2.7/HTMLParser.py", line 117, in feed
    self.goahead(0)
  File "/usr/lib/python2.7/HTMLParser.py", line 161, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.7/HTMLParser.py", line 308, in parse_starttag
    attrvalue = self.unescape(attrvalue)
  File "/home/vagrant/todoist/libs/ist_libs/python/html2text.py", line 715, in unescape
  File "/home/vagrant/todoist/libs/ist_libs/python/html2text.py", line 710, in replaceEntities
  File "/home/vagrant/todoist/libs/ist_libs/python/html2text.py", line 685, in charref
ValueError: invalid literal for int() with base 10: 'a'

Fix unescape() error when a invalid html entity is given

be14e78

stefanor pushed a commit to stefanor/html2text that referenced this pull request Jan 14, 2016

Remove python 3.5 support, due to aaronsw#98

9157181

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve "r_unescape" regular expression to skip invalid HTML entities #98

Improve "r_unescape" regular expression to skip invalid HTML entities #98

Uh oh!

stkao05 commented Mar 30, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Improve "r_unescape" regular expression to skip invalid HTML entities #98

Are you sure you want to change the base?

Improve "r_unescape" regular expression to skip invalid HTML entities #98

Uh oh!

Conversation

stkao05 commented Mar 30, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant