diff --git a/doc/Strings.md b/doc/Strings.md index c0a28a650e7..c50bf860d7a 100644 --- a/doc/Strings.md +++ b/doc/Strings.md @@ -96,7 +96,7 @@ The key bottleneck of the algorithm is step 2: identifying overlapping strings. Recall that an overlap is a pair (left, right) where there is some string that is simultaneously a suffix of left, and a prefix of right. For example, `splitpea` and `peasoup` has overlap of 3. We wish to find all parents and all overlaps. We are armed with a suffix array: a sorted list of suffixes of all our strings, where each suffix points back to the string(s) that contain it. -We loop over the right strings, and find all overlapping left mates. For each right string, we loop over its prefixes in increasing length. For example, given `peasoup` we loop over `p`, `pe`, `pea`... Call this the test prefix. We use binary search in our suffix array to identify suffixes that are prefixed by the the test prefix. This set of suffixes is necessarily contiguous, because the suffix array is sorted. If a suffix is exactly equal to that prefix, then we have an overlap. This is necessarily the leftmost element of our range, because our range is sorted. +We loop over the right strings, and find all overlapping left mates. For each right string, we loop over its prefixes in increasing length. For example, given `peasoup` we loop over `p`, `pe`, `pea`... Call this the test prefix. We use binary search in our suffix array to identify suffixes that are prefixed by the test prefix. This set of suffixes is necessarily contiguous, because the suffix array is sorted. If a suffix is exactly equal to that prefix, then we have an overlap. This is necessarily the leftmost element of our range, because our range is sorted. A key idea is that moving to the next prefix can only narrow the matching suffixes. For example, only suffixes that have `pe` as a prefix may have `pea` as a prefix. Therefore we do not need to reset the binary search range across iterations. Furthermore, for each iteration, we only need to consider one character in each suffix, since we know all previous characters necessarily match. This is key to performance.