Token based Fuzzy Match (from old wiki) #3989

chenlica · 2025-10-21T05:42:05Z

chenlica
Oct 21, 2025
Collaborator

From the page https://github.com/apache/texera/wiki/Token-based-Fuzzy-Match (may be dangling)

===

Authors: Varun Bharill, Parag Saraogi.

Reviewers:

Ideas from paper - "Efficient Merging and Filtering Algorithms for Approximate String Searches"

Creating N-gram inverted lists.
A single document can be treated as a string and all the N grams of each documents need to be indexed. The final result of this task would be an inverted index of N-grams Vs document-ID.

Questions:
a) How do we fix "N" ?
b) For a particular gram are the document IDs stored in sorted order ?

T-occurance problem.
Given a Query string (q), compute all N-grams. From the index created in step 1, we have a list of documents in which each gram of query string appears.
The T-occurance problem is to find the documents with at least T occurances.
Available algorithms -
a. Heap count
b. Scan Count
c. Divide skip
etc.

Questions:
Lets say, a phrase "University of california" appears in a certain document d1. Assume N = 5. So all the 5-grams of this phase have been indexed. Assume query string q = "University california". From step 2, many 5-grams of the query string will map to different documents. The index created in step 1 might as well give the location of the grams in the document. Based on the mappings of the 5-grams of query string, can we identify a relevant phrase in the document ?

Filtering

We have studied various filtering mechanisms like Length filtering, Position filtering, Prefix filtering which can be combined in the form of a tree structure, where every level corresponds to a filter. Also we studied the sequence in which these filters should be applied to gain maximum performance advantages.

Presentation on Token based fuzzy matching -
https://docs.google.com/presentation/d/123dmjHnazpfU82TVmlT3OXfcCGlK6o0NgLUWAUGj_ns/edit#slide=id.g12b64ffe2d_0_259

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Token based Fuzzy Match (from old wiki) #3989

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Token based Fuzzy Match (from old wiki) #3989

Uh oh!

chenlica Oct 21, 2025 Collaborator

Replies: 0 comments

chenlica
Oct 21, 2025
Collaborator