Improve quote detection by finding start and end quotation marks#380
Improve quote detection by finding start and end quotation marks#380afriedman412 wants to merge 11 commits intochartbeat-labs:developfrom
Conversation
apparently they don't support PY3.9 ... note to self: figure out the readthedocs situation, this seems... not great
|
Hi @afriedman412 , thanks for submitting these changes! I'm pretty sure I follow the updated logic, but it would be particularly helpful to have new tests (or just new test cases for an existing test) that confirm what we should expect to be handled by it. Would you be able to add some in here? |
bdewilde
left a comment
There was a problem hiding this comment.
i've suggested a few changes for performance and stylistic consistency reasons, but nothing major. my big ask continues to be additional testing to make sure that these changes do what we want them to, and to illustrate the improved extraction cases. i can certainly come up with some, but you surely know better than i what to test here.
src/textacy/constants.py
Outdated
| """ | ||
| Ordinal points of the token.is_quote characters, matched up by start and end. | ||
|
|
||
| source: | ||
| switch = "\"\'" | ||
| start = "“‘```“‘«‹「『„‚" | ||
| end = "”’’’’”’»›」』”’" | ||
|
|
||
| """ |
There was a problem hiding this comment.
nit: could you move this docstring to directly below (rather than above) the object's assignment?
src/textacy/constants.py
Outdated
|
|
||
| """ | ||
| QUOTATION_MARK_PAIRS = { | ||
| (34, 34), |
There was a problem hiding this comment.
not critical, but helpful: could you add the corresponding marks in a comment on each line?
src/textacy/extract/triples.py
Outdated
| for n, q in enumerate(qtok): | ||
| if q.i not in [q_[1] for q_ in qtok_pair_idxs]: | ||
| for q_ in qtok[n+1:]: | ||
| if (ord(q.text), ord(q_.text)) in constants.QUOTATION_MARK_PAIRS: | ||
| qtok_pair_idxs.append((q.i, q_.i)) | ||
| break |
There was a problem hiding this comment.
i haven't profiled this code, but this series of nested loops feels like it could be a performance bottleneck. is there a still-clear but more performant alternative?
src/textacy/extract/triples.py
Outdated
| # def get_cue_candidates(window_sents, verbs): | ||
| # return [ | ||
| # tok | ||
| # for sent in window_sents | ||
| # for tok in sent | ||
| # if filter_cue_candidates(tok, verbs) | ||
| # ] |
There was a problem hiding this comment.
should this still be here?
src/textacy/extract/triples.py
Outdated
| return all([ | ||
| tok.pos == VERB, | ||
| tok.lemma_ in verbs | ||
| ]) |
There was a problem hiding this comment.
| return all([ | |
| tok.pos == VERB, | |
| tok.lemma_ in verbs | |
| ]) | |
| return (tok.pos == VERB and tok.lemma_ in verbs) |
src/textacy/extract/triples.py
Outdated
| return all([ | ||
| candidate.pos!=PUNCT, | ||
| ((candidate.i >= i and candidate.i >= j) or (candidate.i <= i and candidate.i <= j)), | ||
| ]) |
There was a problem hiding this comment.
| return all([ | |
| candidate.pos!=PUNCT, | |
| ((candidate.i >= i and candidate.i >= j) or (candidate.i <= i and candidate.i <= j)), | |
| ]) | |
| return ( | |
| candidate.pos != PUNCT | |
| and ((candidate.i >= i and candidate.i >= j) or (candidate.i <= i and candidate.i <= j)) | |
| ) |
src/textacy/extract/triples.py
Outdated
| for i_, j_ in list(zip( | ||
| [tok.i for tok in doc if tok.text=="\n"], | ||
| [tok.i for tok in doc if tok.text=="\n"][1:]) | ||
| ): |
There was a problem hiding this comment.
trying to improve efficiency/performance here
| for i_, j_ in list(zip( | |
| [tok.i for tok in doc if tok.text=="\n"], | |
| [tok.i for tok in doc if tok.text=="\n"][1:]) | |
| ): | |
| lb_tok_idxs = [tok.i for tok in doc if tok.text == "\n"] | |
| for i_, j_ in zip(lb_tok_idxs, lb_tok_idxs[1:]): |
src/textacy/extract/triples.py
Outdated
| """ | ||
| if para: | ||
| i_, j_ = line_break_window(i, j, doc) | ||
| if i_: |
There was a problem hiding this comment.
| if i_: | |
| if i_ is not None: |
src/textacy/extract/triples.py
Outdated
| return ( | ||
| sent | ||
| for sent in doc.sents | ||
| # these boundary cases are a subtle bit of work... |
src/textacy/extract/triples.py
Outdated
| windower(qtok_start_idx, qtok_end_idx, doc, True), | ||
| windower(qtok_start_idx, qtok_end_idx, doc) |
There was a problem hiding this comment.
is it possible for these two approaches to ever produce identical windows? does that matter?
|
hey @afriedman412 , does the newer PR (#382) supersede this one? |
…x` package, min_quote_length is now a `direct_quotations` parameter (not a constant), added a better example for testing linebreaks that function as closing quotes
The current code for detecting quotes is pretty unsophisticated. It just sequentially pairs anything the token.is_quote deems a quotation mark and assumes the indexes to be the quote boundaries. If there are an odd number of quotation marks, it throws an error. I've been doing quote detection in some of unreliably formatted text lately which has things like "»" used as bullet points and lots of unpredictable stray characters, so I came up with a workaround.
My version only matches specific types of quotation marks with their accepted counterparts. For example, if there is
“it will pair it with the next”. If it never finds a match, it ignores the singleton quotation mark. I did this by making a list of 11 approved tuples of unicode codes (which live in .constants).For example:
Bill told me I "shouldn‘t wear those pants" but I will.In the current version, running quote detection here would raise an error because there are three quotation mark-like tokens in the sentence. Even if it didn't, it would return "shouldn" as a quote because textacy assumes sequential quotation marks are quote boundaries.
My version takes the first quotation mark (
q) and iterates through all the later quotation marks until it finds one (q_) where(ord(q.text), ord(q_.text))is in the list of acceptable pairs. Here, that pair would be(34, 34)because a double quotation mark is unicode number 34.