-
Notifications
You must be signed in to change notification settings - Fork 61
Description
I had a play with string_score, and while it's excellent for what it does, it doesn't quite fit my use case out of the box. In particular, it lacks recognition for highly similar substrings that are in a different order.
tl;rd: Do you plan to add support for out-of-order substring matching? Or can you at least think of a smart way to do it? If this is totally outside of the scope of string_score, then go ahead and close this. I'm mostly just rubber ducking the problem.
For background, I have two spreadsheets where each row represents a building with some attributes, including ID and name. I'm told that both spreadsheets contain the same 66 buildings, except that one of the spreadsheets has 72 rows, and neither of them use the same IDs or names consistently. One will abbreviate some names, the other will abbreviate others, or the same ones but in a different way. It's a mess, so I'd like an automated, objective mechanism for associating the "matching" rows and ultimately merging the attributes.
For example, when searching for a match for 2G8 Bahagian Pinjaman Perumahan, string_score with 0.5 fuzziness thinks that PMO is a better match than LOT 2G8 (2M10 & 2M11) Bhg. Pinjaman Perumahan, JPM. Or for a more English example, comparing university of oxford with oxford of university scores 0.027.
To address this failure mode, I've wrapped it in a pretty gnarly loop:
- Both the search string and the comparison string are split into words
- Each word of the search string is string_score'd against each word of the comparison string (concatenating the whole comparison string as an option for matching abbreviations)
- For each search word, the score for the most similar comparison word is recorded
- The score for the search string against the comparison string is taken as the sum of the (maximum) score of each word, normalised by dividing by the number of search words.
Clearly this is more expensive (something like an order of magnitude, or at least a factor of the average number of words per string), but it's pretty easy to implement given what string_score already does. Can you think of a straightforward way to modify your algorithm to handle this kind of case? Or even just a smarter way to package it than mine?