Token based Fuzzy Match (from old wiki) #3989
Closed
chenlica
started this conversation in
archived-wiki
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
From the page https://github.com/apache/texera/wiki/Token-based-Fuzzy-Match (may be dangling)
===
Authors: Varun Bharill, Parag Saraogi.
Reviewers:
Ideas from paper - "Efficient Merging and Filtering Algorithms for Approximate String Searches"
A single document can be treated as a string and all the N grams of each documents need to be indexed. The final result of this task would be an inverted index of N-grams Vs document-ID.
Questions:
a) How do we fix "N" ?
b) For a particular gram are the document IDs stored in sorted order ?
Given a Query string (q), compute all N-grams. From the index created in step 1, we have a list of documents in which each gram of query string appears.
The T-occurance problem is to find the documents with at least T occurances.
Available algorithms -
a. Heap count
b. Scan Count
c. Divide skip
etc.
Questions:
Lets say, a phrase "University of california" appears in a certain document d1. Assume N = 5. So all the 5-grams of this phase have been indexed. Assume query string q = "University california". From step 2, many 5-grams of the query string will map to different documents. The index created in step 1 might as well give the location of the grams in the document. Based on the mappings of the 5-grams of query string, can we identify a relevant phrase in the document ?
We have studied various filtering mechanisms like Length filtering, Position filtering, Prefix filtering which can be combined in the form of a tree structure, where every level corresponds to a filter. Also we studied the sequence in which these filters should be applied to gain maximum performance advantages.
Presentation on Token based fuzzy matching -
https://docs.google.com/presentation/d/123dmjHnazpfU82TVmlT3OXfcCGlK6o0NgLUWAUGj_ns/edit#slide=id.g12b64ffe2d_0_259
Beta Was this translation helpful? Give feedback.
All reactions