This repository was archived by the owner on May 5, 2022. It is now read-only.
Adding ftfy to fix encoding issues#537
Open
albarrentine wants to merge 2 commits intoopenaddresses:masterfrom
Open
Adding ftfy to fix encoding issues#537albarrentine wants to merge 2 commits intoopenaddresses:masterfrom
albarrentine wants to merge 2 commits intoopenaddresses:masterfrom
Conversation
…on encoding issues at te conform level
Member
|
Looking good, I’ll keep an eye on this PR! |
Contributor
Author
|
Ran tests locally but saw the same test fail on master as on this branch. Is that a known issue? |
Member
|
The failing test was added here, to uncover GDAL 2.x related issues: #514 It looks like |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hey all,
As mentioned in openaddresses/openaddresses#2392, this PR adds the supremely handy ftfy to machine for fixing common encoding issues like Mojibake, etc. Should also help in cases where there's mixed encoding, say if a source contains some legacy data with a different encoding that was never converted after a migration.
As a dependency it's a simple pure-Python library so doesn't add much weight, works on Python 2 & 3, etc. I use it whenever I have to deal with text from the web, including on the Python/ingestion side of libpostal when importing data sets like OpenAddresses and OSM.
It works on Unicode, so essentially just wrap every call to
.decodewith a call toftfy.fix_encoding. Here's a (Python2) usage example:At present my dev setup for machine is not fully baked so using the PR to run tests.