Improve quote detection by finding start and end quotation marks#380

Open

afriedman412 wants to merge 11 commits intochartbeat-labs:developfrom

afriedman412:dev

afriedman412 commented May 1, 2023

The current code for detecting quotes is pretty unsophisticated. It just sequentially pairs anything the token.is_quote deems a quotation mark and assumes the indexes to be the quote boundaries. If there are an odd number of quotation marks, it throws an error. I've been doing quote detection in some of unreliably formatted text lately which has things like "»" used as bullet points and lots of unpredictable stray characters, so I came up with a workaround.

My version only matches specific types of quotation marks with their accepted counterparts. For example, if there is “ it will pair it with the next ”. If it never finds a match, it ignores the singleton quotation mark. I did this by making a list of 11 approved tuples of unicode codes (which live in .constants).

For example:

Bill told me I "shouldn‘t wear those pants" but I will.

In the current version, running quote detection here would raise an error because there are three quotation mark-like tokens in the sentence. Even if it didn't, it would return "shouldn" as a quote because textacy assumes sequential quotation marks are quote boundaries.

My version takes the first quotation mark (q) and iterates through all the later quotation marks until it finds one (q_) where (ord(q.text), ord(q_.text)) is in the list of acceptable pairs. Here, that pair would be (34, 34) because a double quotation mark is unicode number 34.

bdewilde and others added 6 commits

April 2, 2023 19:08


          build: Use PY3.8 for readthedocs

f583ae2

apparently they don't support PY3.9 ...

note to self: figure out the readthedocs situation, this seems... not great


          refactor: Use preferred suffix on readthedocs cfg

5241b1c


          build: Update rtd cfg to their current standards

f08ecbc


          Improve quote detection by finding start and end quotation marks

23abe0d


          revert some unrelated changes

e8e8971


          added improved direct_quotations function, helper functions, constant…

5c67017

…s and tests

Collaborator

bdewilde commented May 21, 2023

Hi @afriedman412 , thanks for submitting these changes! I'm pretty sure I follow the updated logic, but it would be particularly helpful to have new tests (or just new test cases for an existing test) that confirm what we should expect to be handled by it. Would you be able to add some in here?

bdewilde reviewed

View reviewed changes

Collaborator

bdewilde left a comment

i've suggested a few changes for performance and stylistic consistency reasons, but nothing major. my big ask continues to be additional testing to make sure that these changes do what we want them to, and to illustrate the improved extraction cases. i can certainly come up with some, but you surely know better than i what to test here.

src/textacy/constants.py Outdated

Comment on lines 24 to 32

+              """
+              Ordinal points of the token.is_quote characters, matched up by start and end.
+              source:
+              switch = "\"\'"
+              start = "“‘```“‘«‹「『„‚"
+              end = "”’’’’”’»›」』”’"
+              """

Collaborator

bdewilde May 22, 2023

nit: could you move this docstring to directly below (rather than above) the object's assignment?

src/textacy/constants.py Outdated

+              """
+              QUOTATION_MARK_PAIRS = {
+                  (34, 34),

Collaborator

bdewilde May 22, 2023

not critical, but helpful: could you add the corresponding marks in a comment on each line?

src/textacy/extract/triples.py Outdated

Comment on lines 231 to 236

+                  for n, q in enumerate(qtok):
+                      if q.i not in [q_[1] for q_ in qtok_pair_idxs]:
+                          for q_ in qtok[n+1:]:
+                              if (ord(q.text), ord(q_.text)) in constants.QUOTATION_MARK_PAIRS:
+                                  qtok_pair_idxs.append((q.i, q_.i))
+                                  break

Collaborator

bdewilde May 22, 2023

i haven't profiled this code, but this series of nested loops feels like it could be a performance bottleneck. is there a still-clear but more performant alternative?

src/textacy/extract/triples.py Outdated

Comment on lines 328 to 334

+              # def get_cue_candidates(window_sents, verbs):
+              #     return [
+              #             tok
+              #             for sent in window_sents
+              #             for tok in sent
+              #             if filter_cue_candidates(tok, verbs)
+              #         ]

Collaborator

bdewilde May 22, 2023

should this still be here?

src/textacy/extract/triples.py Outdated

Comment on lines 323 to 326

+                  return all([
+                      tok.pos == VERB,
+                      tok.lemma_ in verbs
+                  ])

Collaborator

bdewilde May 22, 2023

Suggested change

      
                return all([
          
                    tok.pos == VERB,
          
                    tok.lemma_ in verbs
          
                ])
          
                return (tok.pos == VERB and tok.lemma_ in verbs)

src/textacy/extract/triples.py Outdated

Comment on lines 340 to 343

+                  return all([
+                          candidate.pos!=PUNCT,
+                          ((candidate.i >= i and candidate.i >= j) or (candidate.i <= i and candidate.i <= j)),
+                      ])

Collaborator

bdewilde May 22, 2023

Suggested change

      
                return all([
          
                        candidate.pos!=PUNCT,
          
                        ((candidate.i >= i and candidate.i >= j) or (candidate.i <= i and candidate.i <= j)),
          
                    ])
          
                return (
          
                    candidate.pos != PUNCT
          
                    and ((candidate.i >= i and candidate.i >= j) or (candidate.i <= i and candidate.i <= j))
          
                )

src/textacy/extract/triples.py Outdated

Comment on lines 349 to 352

+                  for i_, j_ in list(zip(
+                      [tok.i for tok in doc if tok.text=="\n"],
+                      [tok.i for tok in doc if tok.text=="\n"][1:])
+                      ):

Collaborator

bdewilde May 22, 2023

trying to improve efficiency/performance here

Suggested change

      
                for i_, j_ in list(zip(
          
                    [tok.i for tok in doc if tok.text=="\n"],
          
                    [tok.i for tok in doc if tok.text=="\n"][1:])
          
                    ):
          
                lb_tok_idxs = [tok.i for tok in doc if tok.text == "\n"]
          
                for i_, j_ in zip(lb_tok_idxs, lb_tok_idxs[1:]):

src/textacy/extract/triples.py Outdated

+                  """
+                  if para:
+                      i_, j_ = line_break_window(i, j, doc)
+                      if i_:

Collaborator

bdewilde May 22, 2023

Suggested change

      
                    if i_:
          
                    if i_ is not None:

src/textacy/extract/triples.py Outdated

+                      return (
+                          sent
+                          for sent in doc.sents
+                          # these boundary cases are a subtle bit of work...

Collaborator

bdewilde May 22, 2023

🙇‍♂️

src/textacy/extract/triples.py Outdated

Comment on lines 264 to 265

		windower(qtok_start_idx, qtok_end_idx, doc, True),
		windower(qtok_start_idx, qtok_end_idx, doc)

Collaborator

bdewilde May 22, 2023

is it possible for these two approaches to ever produce identical windows? does that matter?

afriedman412 added 4 commits

June 15, 2023 15:13


          updates made, tests added

1c8a8ac


          simplifying integrating linebreaks into quote detection

ff1ce50


          final revamp

e2568fe


          Merge branch 'chartbeat-labs:main' into dev

5c12aa9

Collaborator

bdewilde commented Jul 19, 2023

hey @afriedman412 , does the newer PR (#382) supersede this one?


          files formatted with black, changed code to use re instead of `rege…

6170a7b

…x` package, min_quote_length is now a `direct_quotations` parameter (not a constant), added a better example for testing linebreaks that function as closing quotes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet