Skip to content

Conversation

@Insutanto
Copy link

when the code parse html code like:
‎June, 2016
program will throw IndexError exception.
I find this bug in the implement of handle_charref.

In handle_data, it maybe match the zero element of char, but the lrm and rlm character are defined as ''(empty).
So, when program match the zero element of lrm and rlm character data,
+++++++++++++++++++++++++++++++++++
elif (self.preceding_stressed
and re.match(r'[^\s.!?]', data[0])
and not hn(self.current_tag)
and self.current_tag not in ['a', 'code', 'pre']):
+++++++++++++++++++++++++++++++++++
This is traceback:
Traceback (most recent call last):
File "get_email.py", line 37, in
text = h.handle(mail_content_string) # html格式 转成 markdown 格式
File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/init.py", line 149, in handle
self.feed(data)
File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/init.py", line 146, in feed
HTMLParser.HTMLParser.feed(self, data)
File "/usr/lib64/python3.4/html/parser.py", line 165, in feed
self.goahead(0)
File "/usr/lib64/python3.4/html/parser.py", line 268, in goahead
self.handle_charref(name)
File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/init.py", line 186, in handle_charref
self.handle_data(self.charref(c), True)
File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/init.py", line 802, in handle_data
and re.match(r'[^\s.!?]', data[0])
IndexError: string index out of range

when the code parse html code like:
<b> &#8206;June, 2016</b>
program will throw IndexError exception.
I find this bug in the implement of handle_charref.

In handle_data, it maybe match the zero element of char, but the lrm and rlm character are defined as ''(empty).
So, when program match the zero element of lrm and rlm character data, 
+++++++++++++++++++++++++++++++++++
        elif (self.preceding_stressed
              and re.match(r'[^\s.!?]', data[0])
              and not hn(self.current_tag)
              and self.current_tag not in ['a', 'code', 'pre']):
+++++++++++++++++++++++++++++++++++
This is traceback:
Traceback (most recent call last):
  File "get_email.py", line 37, in <module>
    text = h.handle(mail_content_string)  # html格式 转成 markdown 格式
  File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/__init__.py", line 149, in handle
    self.feed(data)
  File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/__init__.py", line 146, in feed
    HTMLParser.HTMLParser.feed(self, data)
  File "/usr/lib64/python3.4/html/parser.py", line 165, in feed
    self.goahead(0)
  File "/usr/lib64/python3.4/html/parser.py", line 268, in goahead
    self.handle_charref(name)
  File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/__init__.py", line 186, in handle_charref
    self.handle_data(self.charref(c), True)
  File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/__init__.py", line 802, in handle_data
    and re.match(r'[^\s.!?]', data[0])
IndexError: string index out of range
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant