BasicSearchAlgorithms

Basic Search Algorithms and Symbol Tables

1. Introduction: The Search Problem

1.1 What is the Search Problem?

At its core, the search problem is about finding specific information within a larger collection of data. We are given a collection of items and a criterion (often called a search key), and we want to locate the item(s) in the collection that match the key.

Often, the data we are searching for isn't just a single item identified by a key, but rather a key associated with additional value. For example, in a phone book, you search for a person's name (the key) to find their phone number (the value). In a dictionary, you search for a word (the key) to find its definition (the value). This leads us naturally to consider search in the context of key-value pairs.

1.2 Why is Search Important?

Search is a fundamental operation in computer science and is ubiquitous in almost every software application you use:

Databases: Finding specific records.
File Systems: Locating files by name.
Web Search Engines: Finding relevant web pages for a query.
Dictionaries and Maps: Storing and retrieving data associated with unique identifiers.
Compilers and Interpreters: Looking up variable names in symbol tables.
Networking: Routing data packets based on destination addresses.

The efficiency of search algorithms directly impacts the performance of these applications. A slow search can make an entire system feel sluggish. Therefore, understanding and implementing efficient search methods is crucial for any computer scientist.

1.3 Goal: Efficient Data Retrieval

The primary goal when studying search algorithms and data structures designed for searching is to achieve efficient data retrieval. What constitutes "efficient" often depends on the size of the data collection and the frequency of search operations versus other operations like adding or removing data. We will analyze the performance of different approaches using Big O notation to understand how their execution time scales with the size of the input data, N.

2. Symbol Tables: An Abstract Model for Key-Value Search

2.1 What is a Symbol Table (ST)?

A Symbol Table is an Abstract Data Type (ADT) that models the concept of a collection of key-value pairs. Think of it like a dictionary where each word (key) has a unique definition (value).

Keys are unique identifiers.
Each key is associated with exactly one value.
The primary operations involve using a key to interact with its associated value.

Symbol Tables are also known by other names like Map, Dictionary, or Associative Array in various programming languages and contexts.

2.2 Common Symbol Table API (General/Unordered ST)

A basic Symbol Table ADT provides the following fundamental operations:

put(key, value): Adds a new key-value pair to the table. If the key already exists, the old value is overwritten with the new value.
```
void put(Key key, Value value);
```
get(key): Returns the value associated with the given key. Returns null (or throws an error) if the key is not in the table.
```
Value get(Key key);
```
delete(key): Removes the key and its associated value from the table.
```
void delete(Key key);
```
contains(key): Returns true if the table contains the given key, false otherwise. Often implemented using get().
```
boolean contains(Key key);
```
size(): Returns the number of key-value pairs in the table.
```
int size();
```
isEmpty(): Returns true if the table is empty, false otherwise. Often implemented using size().
```
boolean isEmpty();
```
keys(): Returns an iterable collection of all keys in the table. The order of keys is not specified for a general Symbol Table.
```
Iterable<Key> keys();
```

2.3 Ordered Symbol Table API (Additional Operations)

Some Symbol Table implementations maintain the keys in sorted order. This allows for additional useful operations beyond the basic ones, which are not possible (or not efficient) if the keys are not ordered.

Assuming keys are comparable (implement an interface like Java's Comparable), an Ordered Symbol Table API might include:

min(): Returns the smallest key in the table.
```
Key min();
```
max(): Returns the largest key in the table.
```
Key max();
```
floor(key): Returns the largest key in the table less than or equal to the given key.
```
Key floor(Key key);
```
ceil(key): Returns the smallest key in the table greater than or equal to the given key.
```
Key ceil(Key key);
```
rank(key): Returns the number of keys in the table strictly less than the given key. This tells you where the key would be if the table were sorted.
```
int rank(Key key);
```

select(k): Returns the key of rank k (the k-th smallest key).

Key select(int k); // 0-indexed or 1-indexed depending on convention

keys(low, high): Returns an iterable collection of all keys in the table within the specified range (low to high, inclusive).
```
Iterable<Key> keys(Key low, Key high);
```

These ordered operations demonstrate the power of maintaining data in a sorted structure, which we will see is a requirement for efficient binary search.

3. Sequential Search

3.1 The Algorithm

Sequential search, also known as linear search, is the simplest search algorithm. It involves examining each element in a collection one by one, in sequence, starting from the beginning, until the target element is found or the end of the collection is reached.

Here's the basic idea to find a target key in a collection:

Start at the first element.
Compare the current element's key with the target key.
If they match, the key is found. Return the associated value.
If they don't match, move to the next element.
Repeat steps 2-4 until the key is found or there are no more elements to check.
If the end is reached without finding the key, it's not in the collection.

A key characteristic of sequential search is that it does not require the data to be in any specific order.

3.2 Performance Analysis (Big O)

Let N be the number of elements in the collection.

Best Case: The target element is the first element checked. This requires only 1 comparison. Performance is $O(1)$.
Worst Case: The target element is the last element checked, or the element is not present in the collection. This requires checking all N elements. Performance is $O(N)$.
Average Case: On average, if the element is present and equally likely to be in any position, we might expect to check about half the elements ($N/2$). Big O notation ignores constant factors, so the average case performance is also $O(N)$.

Sequential search is easy to implement but becomes inefficient for large datasets because the time required to find an element grows linearly with the size of the dataset.

4. Implementation using Sequential Search: SequentialSearchST

We can implement the Symbol Table ADT using sequential search by storing the key-value pairs in a simple linear data structure like a linked list or an array. Using an unordered linked list is a common way to demonstrate this, as insertions can be relatively simple.

4.1 Data Structure

We can represent the Symbol Table as a singly-linked list where each node stores a key-value pair. The list does not need to be kept in any specific order of keys.

Node -> Node -> Node -> ... -> null
(key1, val1) (key2, val2) (key3, val3)

Each node typically has fields for the key, the value, and a pointer to the next node in the list. The Symbol Table object itself would hold a reference to the first node (the head or first node) and potentially the size.

4.2 How ST Operations Work

get(key): To find the value for a given key, we start at the first node and traverse the linked list sequentially. At each node, we compare the node's key with the search key. If a match is found, we return the associated value. If we reach the end of the list (null) without finding the key, the key is not present, and we return null. This is sequential search.
put(key, value): To add or update a key-value pair, we first traverse the linked list sequentially to see if the key already exists.
- If the key is found, we update the value in that node.
- If the key is not found after traversing the entire list, it's a new key. We create a new node with the key and value and add it to the list. A common simple approach is to add it to the beginning (prepend), updating the first pointer.
delete(key): To remove a key-value pair, we traverse the linked list sequentially. We need to keep track of the previous node as we traverse.
- If the key is found in a node, we bypass that node by linking the previous node directly to the next node after the one being deleted. Special handling is needed if the node to be deleted is the first node.
- If the key is not found after traversing the list, nothing is deleted.
contains(key): Simply call get(key) and check if the result is null.
size(): Can be maintained in a separate variable and incremented on put (if new key) and decremented on delete. Otherwise, requires traversing the list to count nodes $O(N)$.
isEmpty(): Check if size() is 0 or first is null.
keys(): Traverse the linked list and add each key to a list or collection to be returned as an iterator.

4.3 Performance Analysis of SequentialSearchST

Based on the sequential search algorithm used for get, put, and delete:

get: $O(N)$ in the worst and average case, as traversing the list takes time proportional to its length.
put: $O(N)$ in the worst and average case, as searching for the key takes $O(N)$. Adding a new node at the front is $O(1)$, but it's dominated by the search.
delete: $O(N)$ in the worst and average case, due to the search. Relinking is $O(1)$.
size: $O(1)$ if a size variable is maintained, $O(N)$ otherwise.
keys: $O(N)$ to build the iterable.

The SequentialSearchST is simple to implement but performs poorly for large numbers of key-value pairs for the most common operations (get, put, delete).

5. Binary Search

5.1 The Algorithm: Divide and Conquer

Binary search is a much more efficient search algorithm than sequential search, but it has a crucial requirement: it only works on a collection of data that is sorted.

The core idea is "divide and conquer":

Start by examining the element in the middle of the sorted collection.
Compare the middle element's key with the target key.
If they match, the key is found.
If the target key is less than the middle element's key, you know the target must be in the left half of the collection (if it exists at all), because the data is sorted. You can eliminate the right half.
If the target key is greater than the middle element's key, you know the target must be in the right half. You can eliminate the left half.
Repeat the process (steps 1-5) on the remaining half until the key is found or the search space is empty.

Why does the data must be sorted? Because without sorted data, knowing that the target is less than the middle element tells you nothing about which side it might be on. You couldn't eliminate half the search space effectively.

5.2 Performance Analysis (Big O)

Let N be the number of elements in the collection.

With each comparison, binary search effectively cuts the search space in half.

Step 1: Search space is $N$. Check middle. Search space becomes $N/2$.
Step 2: Search space is $N/2$. Check middle. Search space becomes $N/4$.
Step 3: Search space is $N/4$. Check middle. Search space becomes $N/8$. ... and so on.

The number of steps required to reduce the search space to a single element (or determine it's not present) is the number of times you can divide N by 2 until you reach 1. This is the definition of a logarithm base 2.

The performance of binary search is $O(log N)$.

Best Case: The target element is the middle element on the first check. $O(1)$.
Worst Case: The target element is found just before the search space becomes empty. $O(log N)$.
Average Case: $O(log N)$.

Logarithmic growth ($log N$) is dramatically better than linear growth ($N$) for large $N$. For example, searching 1 million items: Sequential Search might take up to 1 million steps, while Binary Search takes around $log_2(1,000,000)$ ≈ 20 steps.

6. Implementation using Binary Search: Ordered Array Symbol Table

To leverage binary search for a Symbol Table, we need a data structure that stores key-value pairs and can keep the keys in sorted order efficiently enough for search. A common way to demonstrate this is using parallel arrays where one array holds the keys and the other holds the corresponding values, and the keys array is strictly sorted.

6.1 Data Structure

We can use two arrays of the same size: keys[] and values[].

keys[i] holds the i-th key in the Symbol Table.
values[i] holds the value associated with keys[i].
The crucial invariant is that the keys array must be kept in sorted order: keys[0] <= keys[1] <= ... <= keys[N-1].

We typically use only the first size elements of the arrays, where size is the current number of key-value pairs.

keys:   [ key_0 | key_1 | key_2 | ... | key_{N-1} |       ...       ]
values: [ val_0 | val_1 | val_2 | ... | val_{N-1} |       ...       ]
Indices:    0       1       2   ...      N-1

where key_0 <= key_1 <= ... <= key_{N-1}.

6.2 How ST Operations Work

get(key): To find the value for a given key, we perform a binary search on the sorted keys array to find the index i where keys[i] is equal to the search key. If found at index i, we return values[i]. If binary search determines the key is not present, we return null. This operation directly benefits from binary search's O(log N) performance.
put(key, value): To add or update a key-value pair:
1. Use binary search (or an adaptation of it, like rank) to find the correct index i where the key should be located in the sorted keys array.
2. If keys[i] is already equal to key, we simply update values[i] with the new value. This is relatively fast, $O(log N)$ for the search + $O(1)$ for the update.
3. If key is not found (i.e., the index i is where it should be inserted to maintain order), we need to make space for the new key-value pair at index i. This involves shifting all elements from index i to the end of the array one position to the right. Then, we insert the new key at keys[i] and the new value at values[i]. Shifting N-i elements can take up to $O(N)$ time in the worst case (when inserting at the beginning).
delete(key): To remove a key-value pair:
1. Use binary search to find the index i of the key to be deleted. $O(log N)$.
2. If the key is found at index i, we need to remove it. This involves shifting all elements from index i+1 to the end of the array one position to the left to fill the gap. Shifting N-1-i elements can take up to O(N) time in the worst case (when deleting from the beginning).
contains(key): Use binary search to find the key's index. If the index is valid and keys[index] equals the key, return true, otherwise false. $O(log N)$.
size(): $O(1)$ if the number of elements is tracked.
isEmpty(): $O(1)$.
keys(): $O(N)$ to iterate through the elements in sorted order.

6.3 Implementing Ordered ST Operations Efficiently

The sorted nature of the keys array makes the ordered Symbol Table operations very efficient:

min(): Return keys[0]. $O(1)$.
max(): Return keys[size-1]. $O(1)$.
floor(key), ceil(key): Use variants of binary search to find the appropriate index. $O(log N)$.
rank(key): Binary search can be adapted to return the count of keys less than the given key (the index where it would be inserted if not present, or its index if present). $O(log N)$.
select(k): Return keys[k]. $O(1)$ (assuming k is a valid index).
keys(low, high): Use binary search (rank) to find the starting index for low and ending index for high. Then iterate through the array from the start index to the end index. Finding indices is $O(log N)$, iterating through the range is O(M) where M is the number of keys in the range. Worst case $M=N$, so $O(N)$.

6.4 Performance Analysis of Ordered Array ST

get: $O(log N)$
put: $O(N)$ (dominated by shifting)
delete: $O(N)$ (dominated by shifting)
size, isEmpty, min, max, select: $O(1)$
contains, floor, ceil, rank: $O(log N)$
keys, keys(low, high): $O(N)$ in the worst case for iteration.

The Ordered Array Symbol Table offers significantly faster search (get, contains) and ordered operations compared to SequentialSearchST. However, put and delete operations are slow because maintaining the sorted order in an array requires shifting elements.

7. Comparison and Conclusions

7.1 Performance Summary Table

Operation	SequentialSearchST (Unordered Linked List)	Ordered Array ST (Sorted Arrays + Binary Search)
`put(key, val)`	$O(N)$	$O(N)$
`get(key)`	$O(N)$	$O(log N)$
`delete(key)`	$O(N)$	$O(N)$
`contains(key)`	$O(N)$	$O(log N)$
`size()`	$O(1)^$ (if tracked)	$O(1)$
`isEmpty()`	$O(1)$	$O(1)$
`min()`, `max()`	$O(N)$	$O(1)$
`floor()`, `ceil()`	$O(N)$	$O(log N)$
`rank()`	$O(N)$	$O(log N)$
`select(k)`	$O(N)$	$O(1)$
`keys()`	$O(N)$	$O(N)$
`keys(low, high)`	$O(N)$	$O(N)$ (Worst case for iteration)

7.2 Trade-offs

SequentialSearchST: Simple to implement, handles insertions/deletions anywhere in the list easily (once the position is found), but all core operations are slow ($O(N)$). Good for small datasets or when simplicity is paramount.
Ordered Array ST: Offers fast search ( $O(log N)$ ) and very fast ordered operations (min, max, select, rank, floor, ceil). However, maintaining the sorted order makes insertions and deletions slow ( $O(N)$ ) due to the need for element shifting. Good for datasets where search and ordered access are frequent, but updates (put, delete) are infrequent.

7.3 Limitations of these Basic Implementations

Neither SequentialSearchST nor the Ordered Array ST provide uniformly efficient performance across all common Symbol Table operations for large datasets:

SequentialSearchST is poor for get, put, and delete.
Ordered Array ST is poor for put and delete.

For many real-world applications, we need a data structure that can perform get, put, and delete operations efficiently, ideally close to $O(log N)$ or even $O(1)$ on average.

7.4 Motivation for More Advanced Search Structures

The limitations of these basic implementations motivate the exploration of more advanced data structures specifically designed to balance search, insertion, and deletion efficiency:

Binary Search Trees (BSTs): Structure data hierarchically based on key order, allowing O(log N) average-case performance for get, put, and delete. However, they can degrade to O(N) in the worst case (if not balanced).
Balanced Binary Search Trees (e.g., Red-Black Trees, AVL Trees): Automatically maintain balance during insertions and deletions, guaranteeing $O(log N)$ worst-case performance for get, put, and delete.
Hash Tables (using Hashing): Aim for $O(1)$ average-case performance for get, put, and delete by using a hash function to directly compute an index into an array. Worst-case can be O(N) due to collisions, but good hash functions and collision resolution strategies make this rare.

These more advanced structures build upon the fundamental concepts of search and data organization introduced here, providing the efficient solutions required for large-scale data management.

BasicSearchAlgorithms

Basic Search Algorithms and Symbol Tables

1. Introduction: The Search Problem

1.1 What is the Search Problem?

1.2 Why is Search Important?

1.3 Goal: Efficient Data Retrieval

2. Symbol Tables: An Abstract Model for Key-Value Search

2.1 What is a Symbol Table (ST)?

2.2 Common Symbol Table API (General/Unordered ST)

2.3 Ordered Symbol Table API (Additional Operations)

3. Sequential Search

3.1 The Algorithm

3.2 Performance Analysis (Big O)

4. Implementation using Sequential Search: SequentialSearchST

4.1 Data Structure

4.2 How ST Operations Work

4.3 Performance Analysis of SequentialSearchST

5. Binary Search

5.1 The Algorithm: Divide and Conquer

5.2 Performance Analysis (Big O)

6. Implementation using Binary Search: Ordered Array Symbol Table

6.1 Data Structure

6.2 How ST Operations Work

6.3 Implementing Ordered ST Operations Efficiently

6.4 Performance Analysis of Ordered Array ST

7. Comparison and Conclusions

7.1 Performance Summary Table

7.2 Trade-offs

7.3 Limitations of these Basic Implementations

7.4 Motivation for More Advanced Search Structures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally