First/Middle name normalization needs to be improved

Current normalization is done in two different way:

From text.py

```python


def normalize_text(text: Optional[str], special_case_apostrophes: bool = False) -> str:
    """
    Normalize text.

    Parameters
    ----------
    text: string
        the text to normalize
    special_case_apostrophie: bool
        whether to replace apostrophes with empty strings rather than spaces

    Returns
    -------
    string: the normalized text
    """
    if text is None or len(text) == 0:
        return ""

    norm_text = unidecode(text).lower()

    if special_case_apostrophes:
        norm_text = norm_text.replace("'", "")

    norm_text = RE_NORMALIZE_WHOLE_NAME.sub(" ", norm_text)
    norm_text = re.sub(r"\s+", " ", norm_text).strip()

    return norm_text
```

1. For _name tuples only_ everything is replaced with a space, including dots, dashes etc.
2. For all other purposes including features, apostrophes are replaced with nothing and all else is replaced with spaces.

Then first and middle are concatted. Then a split is applied. The first element of the split is first, and the rest becomes middle.

Suboptimal because we want Jo Ann, JoAnn, Jo-Ann to be mapped to the same thing. This is not currently happening with the above rules, it is possible and will even match Joann if lower-casing is applied universally, which is part of the text normalization.

This should precede #38 and these better normalizations can be applied to the name tuples before counting as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First/Middle name normalization needs to be improved #39

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

First/Middle name normalization needs to be improved #39

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions