-
Notifications
You must be signed in to change notification settings - Fork 20
Open
Description
Current normalization is done in two different way:
From text.py
def normalize_text(text: Optional[str], special_case_apostrophes: bool = False) -> str:
"""
Normalize text.
Parameters
----------
text: string
the text to normalize
special_case_apostrophie: bool
whether to replace apostrophes with empty strings rather than spaces
Returns
-------
string: the normalized text
"""
if text is None or len(text) == 0:
return ""
norm_text = unidecode(text).lower()
if special_case_apostrophes:
norm_text = norm_text.replace("'", "")
norm_text = RE_NORMALIZE_WHOLE_NAME.sub(" ", norm_text)
norm_text = re.sub(r"\s+", " ", norm_text).strip()
return norm_text- For name tuples only everything is replaced with a space, including dots, dashes etc.
- For all other purposes including features, apostrophes are replaced with nothing and all else is replaced with spaces.
Then first and middle are concatted. Then a split is applied. The first element of the split is first, and the rest becomes middle.
Suboptimal because we want Jo Ann, JoAnn, Jo-Ann to be mapped to the same thing. This is not currently happening with the above rules, it is possible and will even match Joann if lower-casing is applied universally, which is part of the text normalization.
This should precede #38 and these better normalizations can be applied to the name tuples before counting as well.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels