Improved quote detection and attribution#382
Improved quote detection and attribution#382afriedman412 wants to merge 8 commits intochartbeat-labs:mainfrom
Conversation
bdewilde
left a comment
There was a problem hiding this comment.
hey @afriedman412 , thanks for your patience on this. i've left comments requesting a handful of minor changes. there's also a consistent formatting issue that makes diffing a bit hard; could you run black over the changed modules, so that the code formatting is standard / consistent with the rest of textacy?
src/textacy/extract/triples.py
Outdated
| xcomp | ||
| ) | ||
| from spacy.tokens import Doc, Span, Token | ||
| import regex as re |
There was a problem hiding this comment.
textacy doesn't currently have regex as a dependency. is it possible to use the stdlib re module here instead?
There was a problem hiding this comment.
probably? I've had some issues using re when working with finicky regular expressions, so I kind of use it by default now but I can test it.
| import collections | ||
| from operator import attrgetter | ||
| from typing import Iterable, Mapping, Optional, Pattern | ||
| from typing import Iterable, Mapping, Optional, Pattern, Literal |
There was a problem hiding this comment.
| from typing import Iterable, Mapping, Optional, Pattern, Literal | |
| from typing import Iterable, Literal, Mapping, Optional, Pattern |
| content = doc[qtok_start_idx : qtok_end_idx + 1] | ||
| # pairs up quotation-like characters based on acceptable start/end combos | ||
| # see constants for more info | ||
| qtoks = [tok for tok in doc if tok.is_quote or (re.match(r"\n", tok.text))] |
There was a problem hiding this comment.
why do we consider tokens with "\n" in them to be quotation-like?
There was a problem hiding this comment.
some formatting dictates that if you start a new paragraph while quoting someone, you start the next paragraph with a quotation mark even though the original quotation mark is never closed. in that case the linebreak functions as a closing quotation mark.
i actually added a test for it but it's actually not a great example -- I'll find a better one.
src/textacy/extract/triples.py
Outdated
| obj, | ||
| pobj, | ||
| xcomp, | ||
| xcomp |
There was a problem hiding this comment.
this comma was here for a reason -- black put it there automatically :)
| xcomp | |
| xcomp, |
src/textacy/constants.py
Outdated
| OBJ_DEPS: set[str] = {"attr", "dobj", "dative", "oprd"} | ||
| AUX_DEPS: set[str] = {"aux", "auxpass", "neg"} | ||
|
|
||
| MIN_QUOTE_LENGTH: int=4 |
There was a problem hiding this comment.
i'm not sure this is a "constant" value, seems more like a parameter with a default value that should go in the direct_quotations extraction function. what do you think?
| and q.i > qtok_idx_pairs[-1][1] | ||
| ): | ||
| for q_ in qtoks[n+1:]: | ||
| if (ord(q.text), ord(q_.text)) in constants.QUOTATION_MARK_PAIRS: |
There was a problem hiding this comment.
why do we store -- and compare against -- the ord values instead of just the "raw" text quotation marks?
There was a problem hiding this comment.
could you elaborate on that?
There was a problem hiding this comment.
there are lots of ways something that is supposed to be a raw text quotation mark gets tokenized incorrectly, when you start dealing with encoding, decoding, pulling text from html, escape character issues, the whitespace_ issue above, etc etc etc. as the edge cases piled up, I realized the ord value was consistent no matter what. so I decided to use that instead.
There was a problem hiding this comment.
got it. i think at a later point i may revisit some of this logic, under the assumption that the user has dealt with bad text encodings, etc. before attempting quote detection. but it's probably fine for now :)
| child | ||
| for tc in tok_and_conjuncts | ||
| for child in tc.children | ||
| # TODO: why doesn't compound import from spacy.symbols? |
There was a problem hiding this comment.
just wondering why this line was deleted? it's a comment, for me! :)
There was a problem hiding this comment.
i thought it was my comment lol
tests/extract/test_triples.py
Outdated
| ), | ||
| ) | ||
| ], | ||
| ) | ||
|
|
There was a problem hiding this comment.
looks like your code editor is making some spurious changes (here and elsewhere) and that aren't PEP-compliant / black-enforced. we'll want to fix these before any merge.
tests/extract/test_triples.py
Outdated
| ) | ||
| ] | ||
| ) | ||
|
|
…x` package, min_quote_length is now a `direct_quotations` parameter (not a constant), added a better example for testing linebreaks that function as closing quotes
Description
Motivation and Context
This is part of a larger project to create a package to combine quote detection and attribution with coreference resolution, which will be used for the analysis of several thousand newspaper articles.
How Has This Been Tested?
A/B testing with random samples of said articles, test creation after major changes.
(New tests added as well.)