-
Notifications
You must be signed in to change notification settings - Fork 0
BasicSearchAlgorithms
At its core, the search problem is about finding specific information within a larger collection of data. We are given a collection of items and a criterion (often called a search key), and we want to locate the item(s) in the collection that match the key.
Often, the data we are searching for isn't just a single item identified by a key, but rather a key associated with additional value. For example, in a phone book, you search for a person's name (the key) to find their phone number (the value). In a dictionary, you search for a word (the key) to find its definition (the value). This leads us naturally to consider search in the context of key-value pairs.
Search is a fundamental operation in computer science and is ubiquitous in almost every software application you use:
- Databases: Finding specific records.
- File Systems: Locating files by name.
- Web Search Engines: Finding relevant web pages for a query.
- Dictionaries and Maps: Storing and retrieving data associated with unique identifiers.
- Compilers and Interpreters: Looking up variable names in symbol tables.
- Networking: Routing data packets based on destination addresses.
The efficiency of search algorithms directly impacts the performance of these applications. A slow search can make an entire system feel sluggish. Therefore, understanding and implementing efficient search methods is crucial for any computer scientist.
The primary goal when studying search algorithms and data structures designed for searching is to achieve efficient data retrieval. What constitutes "efficient" often depends on the size of the data collection and the frequency of search operations versus other operations like adding or removing data. We will analyze the performance of different approaches using Big O notation to understand how their execution time scales with the size of the input data, N.
A Symbol Table is an Abstract Data Type (ADT) that models the concept of a collection of key-value pairs. Think of it like a dictionary where each word (key) has a unique definition (value).
- Keys are unique identifiers.
- Each key is associated with exactly one value.
- The primary operations involve using a key to interact with its associated value.
Symbol Tables are also known by other names like Map, Dictionary, or Associative Array in various programming languages and contexts.
A basic Symbol Table ADT provides the following fundamental operations:
-
put(key, value): Adds a new key-value pair to the table. If the key already exists, the old value is overwritten with the new value.void put(Key key, Value value);
-
get(key): Returns the value associated with the given key. Returnsnull(or throws an error) if the key is not in the table.Value get(Key key);
-
delete(key): Removes the key and its associated value from the table.void delete(Key key);
-
contains(key): Returnstrueif the table contains the given key,falseotherwise. Often implemented usingget().boolean contains(Key key);
-
size(): Returns the number of key-value pairs in the table.int size();
-
isEmpty(): Returnstrueif the table is empty,falseotherwise. Often implemented usingsize().boolean isEmpty();
-
keys(): Returns an iterable collection of all keys in the table. The order of keys is not specified for a general Symbol Table.Iterable<Key> keys();
Some Symbol Table implementations maintain the keys in sorted order. This allows for additional useful operations beyond the basic ones, which are not possible (or not efficient) if the keys are not ordered.
Assuming keys are comparable (implement an interface like Java's Comparable), an Ordered Symbol Table API might include:
-
min(): Returns the smallest key in the table.Key min();
-
max(): Returns the largest key in the table.Key max();
-
floor(key): Returns the largest key in the table less than or equal to the givenkey.Key floor(Key key);
-
ceil(key): Returns the smallest key in the table greater than or equal to the givenkey.Key ceil(Key key);
-
rank(key): Returns the number of keys in the table strictly less than the givenkey. This tells you where the key would be if the table were sorted.int rank(Key key);
-
select(k): Returns the key of rankk(the k-th smallest key).Key select(int k); // 0-indexed or 1-indexed depending on convention
-
keys(low, high): Returns an iterable collection of all keys in the table within the specified range (lowtohigh, inclusive).Iterable<Key> keys(Key low, Key high);
These ordered operations demonstrate the power of maintaining data in a sorted structure, which we will see is a requirement for efficient binary search.
Sequential search, also known as linear search, is the simplest search algorithm. It involves examining each element in a collection one by one, in sequence, starting from the beginning, until the target element is found or the end of the collection is reached.
Here's the basic idea to find a target key in a collection:
- Start at the first element.
- Compare the current element's key with the
targetkey. - If they match, the key is found. Return the associated value.
- If they don't match, move to the next element.
- Repeat steps 2-4 until the key is found or there are no more elements to check.
- If the end is reached without finding the key, it's not in the collection.
A key characteristic of sequential search is that it does not require the data to be in any specific order.
Let N be the number of elements in the collection.
-
Best Case: The target element is the first element checked. This requires only 1 comparison. Performance is
$O(1)$ . -
Worst Case: The target element is the last element checked, or the element is not present in the collection. This requires checking all N elements. Performance is
$O(N)$ . -
Average Case: On average, if the element is present and equally likely to be in any position, we might expect to check about half the elements (
$N/2$ ). Big O notation ignores constant factors, so the average case performance is also$O(N)$ .
Sequential search is easy to implement but becomes inefficient for large datasets because the time required to find an element grows linearly with the size of the dataset.
We can implement the Symbol Table ADT using sequential search by storing the key-value pairs in a simple linear data structure like a linked list or an array. Using an unordered linked list is a common way to demonstrate this, as insertions can be relatively simple.
We can represent the Symbol Table as a singly-linked list where each node stores a key-value pair. The list does not need to be kept in any specific order of keys.
Node -> Node -> Node -> ... -> null
(key1, val1) (key2, val2) (key3, val3)
Each node typically has fields for the key, the value, and a pointer to the next node in the list. The Symbol Table object itself would hold a reference to the first node (the head or first node) and potentially the size.
-
get(key): To find the value for a givenkey, we start at thefirstnode and traverse the linked list sequentially. At each node, we compare the node's key with the searchkey. If a match is found, we return the associated value. If we reach the end of the list (null) without finding the key, the key is not present, and we returnnull. This is sequential search. -
put(key, value): To add or update a key-value pair, we first traverse the linked list sequentially to see if thekeyalready exists.- If the
keyis found, we update thevaluein that node. - If the
keyis not found after traversing the entire list, it's a new key. We create a new node with thekeyandvalueand add it to the list. A common simple approach is to add it to the beginning (prepend), updating thefirstpointer.
- If the
-
delete(key): To remove a key-value pair, we traverse the linked list sequentially. We need to keep track of the previous node as we traverse.- If the
keyis found in a node, we bypass that node by linking the previous node directly to the next node after the one being deleted. Special handling is needed if the node to be deleted is thefirstnode. - If the
keyis not found after traversing the list, nothing is deleted.
- If the
-
contains(key): Simply callget(key)and check if the result isnull. -
size(): Can be maintained in a separate variable and incremented onput(if new key) and decremented ondelete. Otherwise, requires traversing the list to count nodes$O(N)$ . -
isEmpty(): Check ifsize()is 0 orfirstisnull. -
keys(): Traverse the linked list and add each key to a list or collection to be returned as an iterator.
Based on the sequential search algorithm used for get, put, and delete:
-
get:$O(N)$ in the worst and average case, as traversing the list takes time proportional to its length. -
put:$O(N)$ in the worst and average case, as searching for the key takes$O(N)$ . Adding a new node at the front is$O(1)$ , but it's dominated by the search. -
delete:$O(N)$ in the worst and average case, due to the search. Relinking is$O(1)$ . -
size:$O(1)$ if a size variable is maintained,$O(N)$ otherwise. -
keys:$O(N)$ to build the iterable.
The SequentialSearchST is simple to implement but performs poorly for large numbers of key-value pairs for the most common operations (get, put, delete).
Binary search is a much more efficient search algorithm than sequential search, but it has a crucial requirement: it only works on a collection of data that is sorted.
The core idea is "divide and conquer":
- Start by examining the element in the middle of the sorted collection.
- Compare the middle element's key with the
targetkey. - If they match, the key is found.
- If the
targetkey is less than the middle element's key, you know the target must be in the left half of the collection (if it exists at all), because the data is sorted. You can eliminate the right half. - If the
targetkey is greater than the middle element's key, you know the target must be in the right half. You can eliminate the left half. - Repeat the process (steps 1-5) on the remaining half until the key is found or the search space is empty.
Why does the data must be sorted? Because without sorted data, knowing that the target is less than the middle element tells you nothing about which side it might be on. You couldn't eliminate half the search space effectively.
Let N be the number of elements in the collection.
With each comparison, binary search effectively cuts the search space in half.
- Step 1: Search space is
$N$ . Check middle. Search space becomes$N/2$ . - Step 2: Search space is
$N/2$ . Check middle. Search space becomes$N/4$ . - Step 3: Search space is
$N/4$ . Check middle. Search space becomes$N/8$ . ... and so on.
The number of steps required to reduce the search space to a single element (or determine it's not present) is the number of times you can divide N by 2 until you reach 1. This is the definition of a logarithm base 2.
The performance of binary search is
-
Best Case: The target element is the middle element on the first check.
$O(1)$ . -
Worst Case: The target element is found just before the search space becomes empty.
$O(log N)$ . -
Average Case:
$O(log N)$ .
Logarithmic growth (
To leverage binary search for a Symbol Table, we need a data structure that stores key-value pairs and can keep the keys in sorted order efficiently enough for search. A common way to demonstrate this is using parallel arrays where one array holds the keys and the other holds the corresponding values, and the keys array is strictly sorted.
We can use two arrays of the same size: keys[] and values[].
-
keys[i]holds the i-th key in the Symbol Table. -
values[i]holds the value associated withkeys[i]. - The crucial invariant is that the
keysarray must be kept in sorted order:keys[0] <= keys[1] <= ... <= keys[N-1].
We typically use only the first size elements of the arrays, where size is the current number of key-value pairs.
keys: [ key_0 | key_1 | key_2 | ... | key_{N-1} | ... ]
values: [ val_0 | val_1 | val_2 | ... | val_{N-1} | ... ]
Indices: 0 1 2 ... N-1
where key_0 <= key_1 <= ... <= key_{N-1}.
-
get(key): To find the value for a givenkey, we perform a binary search on the sortedkeysarray to find the indexiwherekeys[i]is equal to the searchkey. If found at indexi, we returnvalues[i]. If binary search determines the key is not present, we returnnull. This operation directly benefits from binary search's O(log N) performance. -
put(key, value): To add or update a key-value pair:- Use binary search (or an adaptation of it, like
rank) to find the correct indexiwhere thekeyshould be located in the sortedkeysarray. - If
keys[i]is already equal tokey, we simply updatevalues[i]with the newvalue. This is relatively fast,$O(log N)$ for the search +$O(1)$ for the update. - If
keyis not found (i.e., the indexiis where it should be inserted to maintain order), we need to make space for the new key-value pair at indexi. This involves shifting all elements from indexito the end of the array one position to the right. Then, we insert the newkeyatkeys[i]and the newvalueatvalues[i]. Shifting N-i elements can take up to$O(N)$ time in the worst case (when inserting at the beginning).
- Use binary search (or an adaptation of it, like
-
delete(key): To remove a key-value pair:- Use binary search to find the index
iof thekeyto be deleted.$O(log N)$ . - If the key is found at index
i, we need to remove it. This involves shifting all elements from indexi+1to the end of the array one position to the left to fill the gap. Shifting N-1-i elements can take up to O(N) time in the worst case (when deleting from the beginning).
- Use binary search to find the index
-
contains(key): Use binary search to find the key's index. If the index is valid andkeys[index]equals the key, returntrue, otherwisefalse.$O(log N)$ . -
size():$O(1)$ if the number of elements is tracked. -
isEmpty():$O(1)$ . -
keys():$O(N)$ to iterate through the elements in sorted order.
The sorted nature of the keys array makes the ordered Symbol Table operations very efficient:
-
min(): Returnkeys[0].$O(1)$ . -
max(): Returnkeys[size-1].$O(1)$ . -
floor(key),ceil(key): Use variants of binary search to find the appropriate index.$O(log N)$ . -
rank(key): Binary search can be adapted to return the count of keys less than the given key (the index where it would be inserted if not present, or its index if present).$O(log N)$ . -
select(k): Returnkeys[k].$O(1)$ (assuming k is a valid index). -
keys(low, high): Use binary search (rank) to find the starting index forlowand ending index forhigh. Then iterate through the array from the start index to the end index. Finding indices is$O(log N)$ , iterating through the range is O(M) where M is the number of keys in the range. Worst case$M=N$ , so$O(N)$ .
-
get:$O(log N)$ -
put:$O(N)$ (dominated by shifting) -
delete:$O(N)$ (dominated by shifting) -
size,isEmpty,min,max,select:$O(1)$ -
contains,floor,ceil,rank:$O(log N)$ -
keys,keys(low, high):$O(N)$ in the worst case for iteration.
The Ordered Array Symbol Table offers significantly faster search (get, contains) and ordered operations compared to SequentialSearchST. However, put and delete operations are slow because maintaining the sorted order in an array requires shifting elements.
| Operation | SequentialSearchST (Unordered Linked List) | Ordered Array ST (Sorted Arrays + Binary Search) |
|---|---|---|
put(key, val) |
||
get(key) |
||
delete(key) |
||
contains(key) |
||
size() |
|
|
isEmpty() |
||
min(), max()
|
||
floor(), ceil()
|
||
rank() |
||
select(k) |
||
keys() |
||
keys(low, high) |
|
- SequentialSearchST: Simple to implement, handles insertions/deletions anywhere in the list easily (once the position is found), but all core operations are slow ($O(N)$). Good for small datasets or when simplicity is paramount.
-
Ordered Array ST: Offers fast search (
$O(log N)$ ) and very fast ordered operations (min,max,select,rank,floor,ceil). However, maintaining the sorted order makes insertions and deletions slow ($O(N)$ ) due to the need for element shifting. Good for datasets where search and ordered access are frequent, but updates (put,delete) are infrequent.
Neither SequentialSearchST nor the Ordered Array ST provide uniformly efficient performance across all common Symbol Table operations for large datasets:
- SequentialSearchST is poor for
get,put, anddelete. - Ordered Array ST is poor for
putanddelete.
For many real-world applications, we need a data structure that can perform get, put, and delete operations efficiently, ideally close to
The limitations of these basic implementations motivate the exploration of more advanced data structures specifically designed to balance search, insertion, and deletion efficiency:
-
Binary Search Trees (BSTs): Structure data hierarchically based on key order, allowing O(log N) average-case performance for
get,put, anddelete. However, they can degrade to O(N) in the worst case (if not balanced). -
Balanced Binary Search Trees (e.g., Red-Black Trees, AVL Trees): Automatically maintain balance during insertions and deletions, guaranteeing
$O(log N)$ worst-case performance forget,put, anddelete. -
Hash Tables (using Hashing): Aim for
$O(1)$ average-case performance forget,put, anddeleteby using a hash function to directly compute an index into an array. Worst-case can be O(N) due to collisions, but good hash functions and collision resolution strategies make this rare.
These more advanced structures build upon the fundamental concepts of search and data organization introduced here, providing the efficient solutions required for large-scale data management.