Skip to content

BinarySearchTrees

Jorge Londoño edited this page May 29, 2025 · 1 revision

Binary Search Trees and Balanced Variants

1. Introduction to Search Trees

In computer science, a fundamental problem is efficiently storing, searching for, and managing ordered data. We often need to perform operations like:

  • Finding a specific item (search).
  • Adding a new item (insertion).
  • Removing an existing item (deletion).
  • Finding the minimum or maximum item.
  • Finding items within a specific range.

Consider common data structures for this task:

  • Arrays: If sorted, searching is very efficient ( $O(log N)$ using binary search). However, inserting or deleting an element in the middle requires shifting subsequent elements, which is slow ( $O(N)$ ). If unsorted, insertion is $O(1)$, but search is $O(N)$.
  • Linked Lists: Insertion and deletion are efficient ( $O(1)$ if you have a pointer to the node). But searching requires traversing the list from the beginning, which is slow ( $O(N)$ ).

Neither arrays nor linked lists offer a good balance for all operations when the data set is dynamic (items are frequently added or removed) and needs to remain ordered for efficient searching.

This is where trees come into play. Tree structures, particularly Binary Search Trees (BSTs) and their variants, are designed to provide a better trade-off, aiming for logarithmic time complexity ( $O(log N)$ ) for search, insertion, and deletion on average.

Binary Search Trees are foundational data structures used extensively in various applications, including:

  • Implementing maps (key-value stores) and sets.
  • Symbol tables in compilers and interpreters.
  • Databases (for indexing data).
  • Implementing other algorithms and data structures.

Understanding BSTs and their balanced counterparts is crucial for any computer science student.

2. Binary Search Trees (BSTs): The Basics

A Binary Search Tree (BST) is a binary tree where each node has a key, and potentially an associated value. The defining characteristic of a BST is the Binary Search Tree Property:

  • For any given node, the keys in its left subtree are strictly less than the key in the node.
  • For any given node, the keys in its right subtree are strictly greater than the key in the node.

(Note: Some definitions allow equality on one side, but the strict inequality version is common and simplifies some operations).

The structure of a node in a BST typically includes:

  • The key itself.
  • The value associated with the key.
  • A reference (or pointer) to the left child node.
  • A reference (or pointer) to the right child node.

Here's a simple Python class representing a BST node:

class Node:
    def __init__(self, key, value):
        self.key = key
        self.value = value
        self.left = None
        self.right = None

A BST is then represented by a reference to its root node. If the tree is empty, the root is None.

Let's look at the fundamental operations on a BST. These operations recursively (or iteratively) traverse the tree, leveraging the BST property to efficiently locate nodes.

Core Operations

get(key): Searching for a key

To find the value associated with a given key, we start at the root and compare the target key with the current node's key:

  • If the keys match, we've found the node.
  • If the target key is less than the current node's key, we move to the left child.
  • If the target key is greater than the current node's key, we move to the right child.
  • If we reach a None child pointer, the key is not in the tree.
def get(root, key):
    """Searches for a key in the BST."""
    if root is None:
        return None # Key not found

    if key == root.key:
        return root.value
    elif key < root.key:
        return get(root.left, key) # Search left subtree
    else: # key > root.key
        return get(root.right, key) # Search right subtree

put(key, value): Inserting a key-value pair

To insert a new key-value pair, we search for the key as in get. If the key is found, we update the associated value. If the key is not found, the search ends at a None pointer. This None pointer indicates the correct position to insert the new node while maintaining the BST property.

def put(root, key, value):
    """Inserts a key-value pair or updates value if key exists."""
    if root is None:
        # Create a new node at this position
        return Node(key, value)

    if key == root.key:
        root.value = value # Update value for existing key
    elif key < root.key:
        root.left = put(root.left, key, value) # Insert in left subtree
    else: # key > root.key
        root.right = put(root.right, key, value) # Insert in right subtree

    return root # Return the potentially modified subtree root

(Note: The put function above returns the root of the modified subtree. A wrapper method would typically handle the initial call root = put(root, key, value)).

delete(key): Removing a key-value pair

Deletion is the most complex BST operation because simply removing a node might disconnect its subtrees. We need to find a replacement node that can take its place while preserving the BST property.

The deletion process involves finding the node to delete and then handling three cases:

  1. Node has no children (is a leaf): Simply remove the node by setting the parent's child pointer to None.
  2. Node has one child: Replace the node with its only child. The child's subtree then gets connected directly to the deleted node's parent.
  3. Node has two children: This is the tricky case. We cannot simply remove the node. We must find a replacement node from one of its subtrees that can legally occupy its position. The standard approach is to find the in-order successor (the node with the smallest key in the right subtree) or the in-order predecessor (the node with the largest key in the left subtree).
    • Let's choose the in-order successor. This node is the leftmost node in the right subtree. It's guaranteed to have no left child (otherwise, that left child would be the true successor).
    • We copy the successor's key and value into the node marked for deletion.
    • Then, we recursively delete the successor node from the right subtree (which is easier because the successor has at most one child - its right child).
# Helper to find minimum node (used by delete)
def find_min_node(node):
    current = node
    while current.left is not None:
        current = current.left
    return current

def delete(root, key):
    """Deletes a key from the BST."""
    if root is None:
        return None

    if key < root.key:
        root.left = delete(root.left, key)
    elif key > root.key:
        root.right = delete(root.right, key)
    else: # key == root.key - This is the node to delete
        # Case 1: No child or One child
        if root.left is None:
            # Replace with right child (could be None)
            return root.right
        elif root.right is None:
            # Replace with left child
            return root.left
        else:
            # Case 3: Two children
            # Find the in-order successor (smallest in the right subtree)
            successor = find_min_node(root.right)

            # Copy the successor's content to this node
            root.key = successor.key
            root.value = successor.value

            # Delete the successor node from the right subtree
            root.right = delete(root.right, successor.key)

    return root # Return the potentially modified subtree root

(Again, a wrapper would call root = delete(root, key) initially).

Ordered Operations

These operations leverage the sorted nature provided by the BST property.

  • min(): Find the minimum key. Start at the root and repeatedly go to the left child until you reach a node with no left child.
  • max(): Find the maximum key. Start at the root and repeatedly go to the right child until you reach a node with no right child.
  • floor(key): Find the largest key less than or equal to key. Search for key. If found, that's the floor. If not found, the floor is the key of the last node you visited that was less than key, or the floor in the subtree you would have gone into if you had gone right.
  • ceil(key): Find the smallest key greater than or equal to key. Symmetric to floor.
  • rank(key): The number of keys in the tree smaller than key. This involves traversing the tree and counting nodes in left subtrees.
  • select(k): Find the key of rank k (the k-th smallest key). This involves traversing the tree based on subtree sizes.

Traversal Operations

  • keys(): Iterate through all keys in sorted order. This is achieved using an in-order traversal: recursively traverse the left subtree, visit the current node, then recursively traverse the right subtree.
def inorder_traversal(root):
    """Generates keys in sorted order."""
    if root is not None:
        yield from inorder_traversal(root.left)
        yield root.key
        yield from inorder_traversal(root.right)

# Example usage:
# for key in inorder_traversal(my_bst_root):
#     print(key)
  • Range Search (keys(low, high)): Find all keys within a specified range [low, high]. This is a modification of in-order traversal, where you only recurse into subtrees if they might contain keys within the range and only yield keys that fall within the range.

Utility Operations

  • size(): The total number of nodes in the tree. Can be stored and updated during put and delete, or computed recursively.
  • isEmpty(): Check if the tree is empty (root is None).
  • height(): The length of the longest path from the root to a leaf. A single node tree has height 0. An empty tree has height -1 (by convention).

Performance Analysis

The performance of most BST operations depends directly on the height of the tree.

  • get, put, delete, min, max, floor, ceil: In the worst case, these operations might need to traverse a path from the root to a leaf. Their time complexity is $O(height)$.
  • rank, select: These also traverse paths based on key comparisons and subtree sizes, resulting in $O(height)$ complexity.
  • keys() (In-order Traversal): Visits every node exactly once. Time complexity is $O(N)$, but since it must visit N nodes, it's efficient relative to the size of the output.
  • Range Search: Visits nodes within the range plus some nodes outside the range. Time complexity is $O(height + k)$, where k is the number of keys in the range.

The height of a BST is not fixed for a given number of nodes (N).

  • Worst-Case Height: A BST can become skewed if keys are inserted in sorted (ascending or descending) order. In this case, the tree degenerates into a structure similar to a linked list. The height is N.

    1 -> 2 -> 3 -> 4 -> 5
    

    (Conceptually, nodes only have right children, or only left children)

    In this worst-case scenario, get, put, and delete operations take $O(N)$ time, which is no better than a linked list or an unsorted array for searching.

  • Average-Case Height: If keys are inserted in a random order, the tree tends to be relatively balanced. In this case, the height is proportional to $log(N)$.

          4
         / \
        2   6
       / \ / \
      1  3 5  7
    

    (A perfectly balanced tree)

    In the average case, get, put, and delete operations take $O(log N)$ time.

Summary of BST Performance:

Operation Average Case Worst Case
get(key) $O(log N)$ $O(N)$
put(key, value) $O(log N)$ $O(N)$
delete(key) $O(log N)$ $O(N)$
min(), max() $O(log N)$ $O(N)$
floor(), ceil() $O(log N)$ $O(N)$
rank(), select() $O(log N)$ $O(N)$
keys() (traversal) $O(N)$ $O(N)$
keys(low, high) $O(height + k)$ $O(N + k)$
size(), isEmpty() $O(1)$ (if tracked) / $O(N)$ $O(1)$ (if tracked) / $O(N)$
height() $O(log N)$ $O(N)$

The significant drawback of the basic BST is its poor worst-case performance, which occurs if the tree becomes highly unbalanced. For applications where predictable performance is critical, or where insertion order is not random, a basic BST is insufficient.

3. The Need for Balanced Trees

As we saw, the efficiency of BST operations heavily relies on the tree's height. While the average case offers excellent $O(log N)$ performance, the worst case degrades to $O(N)$, eliminating the benefit over simpler structures like sorted arrays for search ( $O(log N)$ ) when considering the overall operations needed in a dynamic scenario.

To guarantee $O(log N)$ performance for all main operations regardless of the insertion/deletion order, we need to ensure the tree remains reasonably balanced. Self-balancing Binary Search Trees are data structures that automatically perform operations (like rotations and color changes) during insertions and deletions to maintain a bounded height, thereby guaranteeing logarithmic time complexity for all operations.

Two prominent types of self-balancing BSTs are Red-Black Trees and AVL Trees.

4. Red-Black Trees

Red-Black Trees are a type of self-balancing BST that guarantees logarithmic time complexity for insertions, deletions, and searches. They achieve balance by enforcing a set of properties on the nodes' "color" (either Red or Black). The balance is not perfect, but it's sufficient to limit the height to $O(log N)$.

Representation

Each node in a Red-Black tree stores:

  • key and value.
  • left and right child pointers.
  • A color attribute, which is either Red or Black.

Often, a special sentinel node (representing NIL or null) is used for all leaves and the children of nodes with less than two children. This sentinel node is always Black. The root's parent is also typically considered to be the black sentinel.

Core Properties of a Red-Black Tree

These five properties collectively guarantee that the longest path from the root to any leaf is at most twice the length of the shortest path, thereby ensuring the height is logarithmic.

  1. Every node is either Red or Black.
  2. The root is Black. (Some definitions allow a red root, but it can always be recolored black without violating other properties).
  3. All leaves (NIL nodes) are Black.
  4. If a node is Red, then both its children are Black. (No two consecutive Red nodes on any path from root to leaf).
  5. For each node, all simple paths from the node to descendant leaves contain the same number of Black nodes. This is called the Black Height property. The black height of a node is the number of black nodes on any path from that node down to a leaf (not including the node itself).

Balancing Operations

When a standard BST insertion or deletion would violate one of these properties, the tree is rebalanced using two types of operations:

  1. Color Changes (Recoloring): Changing the color of nodes.

  2. Rotations: Structural changes that rearrange nodes while preserving the BST property. The two basic rotations are Left Rotation and Right Rotation.

    • Left Rotation: Pivots around a node x and its right child y. y becomes the new root of the subtree, x becomes y's left child, and y's original left child becomes x's right child.

          x                 y
           \               / \
            y     ->      x   C
           / \           / \
          A   C         A   B
      

      (A is x's left child, B is y's left child, C is y's right child)

    • Right Rotation: Symmetric to Left Rotation, pivots around y and its left child x.

            y               x
           /               / \
          x       ->      A   y
         / \                 / \
        A   B               B   C
      

      (A is x's left child, B is x's right child, C is y's right child)

Insertion (put)

  1. Insert the new node just like in a regular BST, and color it Red. (Inserting it Red makes it easier to fix potential property violations; inserting Black would likely violate the Black Height property).
  2. If the parent is Black, property 4 is not violated, and we are done.
  3. If the parent is Red, property 4 is violated (two consecutive Red nodes). This requires fix-up procedures involving checking the color of the "uncle" node (the sibling of the parent).
    • Case 1: Uncle is Red. Recolor the parent, uncle, and grandparent. Then repeat the check from the grandparent's position.
    • Case 2: Uncle is Black (or NIL). This requires rotations and recoloring depending on whether the new node, its parent, and grandparent form a "line" or a "triangle".

These fix-up procedures are designed to restore the Red-Black properties while only propagating changes upwards towards the root.

Deletion (delete)

Deletion is significantly more complex than insertion in Red-Black trees.

  1. Find the node to delete. If it has two children, find its in-order successor (which has at most one child), copy its data to the node to be deleted, and delete the successor instead.
  2. The node actually removed from the tree will have at most one child. Let's call the removed node v and its child c.
  3. If v was Red, simply remove it. No properties are violated.
  4. If v was Black, removing it decreases the black height on paths passing through it, violating property 5. This requires fix-up procedures. The fix-ups are complex and involve identifying cases based on the color of the node c and its siblings/nephews, using combinations of rotations and recoloring to restore the Black Height property.

Performance Analysis

The Red-Black tree properties guarantee that the height of the tree is always logarithmic, specifically, $h \le 2 \log_2(N+1)$. Because all basic operations (get, put, delete, ordered operations) involve traversing a path from the root or near the root, their time complexity is bounded by the height.

  • get(key): $O(log N)$
  • put(key, value): $O(log N)$ (Insertion and fix-ups take logarithmic time).
  • delete(key): $O(log N)$ (Deletion and fix-ups take logarithmic time).
  • Ordered Operations (min, max, floor, ceil, rank, select): $O(log N)$.
  • keys() (Traversal) and Range Search: O(N) or O(height + k) respectively, but the height is guaranteed logarithmic.

Key Takeaway: Red-Black trees provide a guaranteed $O(log N)$ time complexity for all fundamental dynamic set operations, unlike basic BSTs. They are a standard choice for implementing balanced maps and sets in many programming language libraries because they offer a good balance between search performance and the overhead of maintaining balance during updates.

5. AVL Trees

AVL Trees, named after their inventors Adelson-Velsky and Landis, were the first self-balancing BSTs. They maintain a stricter balance condition than Red-Black trees.

Representation

Each node in an AVL tree stores:

  • key and value.
  • left and right child pointers.
  • A balance factor, which is the height of its left subtree minus the height of its right subtree.

AVL Property

For every node in an AVL tree, the balance factor must be -1, 0, or +1.

  • Balance Factor = height(left_subtree) - height(right_subtree)
  • AVL Property: -1 <= balance_factor <= 1 for all nodes.

This strict property guarantees that the height of an AVL tree with N nodes is very close to the theoretical minimum for a binary tree, approximately $h \approx 1.44 \log_2(N+2)$. This is a tighter bound on height than for Red-Black trees.

Balancing Operations

If an insertion or deletion causes a node's balance factor to become +2 or -2, the tree rooted at that node becomes unbalanced according to the AVL property. The balance must be restored using rotations. The rotations are the same Left and Right rotations used in Red-Black trees.

There are four patterns of imbalance (based on where the problematic node was inserted/deleted and the shape of the imbalance below it) requiring different sequences of rotations:

  1. LL (Left-Left) Case: A node becomes unbalanced (+2), and the imbalance is in the left child's left subtree. Requires a single Right Rotation around the unbalanced node.

          Z (+2)            Y
         /                 / \
        Y (+1)     ->     X   Z
       /
      X
    
  2. RR (Right-Right) Case: A node becomes unbalanced (-2), and the imbalance is in the right child's right subtree. Requires a single Left Rotation around the unbalanced node.

        Z (-2)              Y
         \                 / \
          Y (-1)   ->     Z   X
           \
            X
    
  3. LR (Left-Right) Case: A node becomes unbalanced (+2), and the imbalance is in the left child's right subtree. Requires a Left Rotation on the left child, followed by a Right Rotation on the unbalanced node.

          Z (+2)            Z (+2)          X
         /                 /               / \
        Y (-1)     ->     X (+1)   ->    Y   Z
         \               /
          X             A
    

    (Simplified diagram, A, B, C subtrees omitted for clarity)

  4. RL (Right-Left) Case: A node becomes unbalanced (-2), and the imbalance is in the right child's left subtree. Requires a Right Rotation on the right child, followed by a Left Rotation on the unbalanced node.

        Z (-2)            Z (-2)          X
         \                 \               / \
          Y (+1)     ->     X (-1)   ->    Z   Y
         /                   \
        X                     A
    

    (Simplified diagram)

Insertion (put)

  1. Insert the new node like in a regular BST.
  2. Update the height and balance factor of the inserted node and all its ancestors up to the root.
  3. Starting from the inserted node and moving up, find the first node that now violates the AVL property (balance factor becomes +2 or -2).
  4. Apply the appropriate rotation (LL, RR, LR, or RL) at this unbalanced node to restore the balance property for that subtree. This single rotation (or double rotation) is sufficient to rebalance the entire tree after an insertion.

Deletion (delete)

  1. Delete the node like in a regular BST.
  2. Update the height and balance factor of the parent of the deleted node and all its ancestors up to the root.
  3. As with insertion, check balance factors on the path back to the root. If any node becomes unbalanced, apply the appropriate rotation(s). Unlike insertion, deletion might require multiple rotations as you trace back up the tree, because rebalancing a subtree might cause its parent to become unbalanced.

Performance Analysis

Due to the stricter balance property, the height of an AVL tree is always very close to $log_2(N)$, specifically $h &lt; 1.44 \log_2(N+2)$.

  • get(key): $O(log N)$
  • put(key, value): $O(log N)$ (Insertion and balance factor updates plus at most two rotations take logarithmic time).
  • delete(key): $O(log N)$ (Deletion and balance factor updates plus rotations tracing back up the tree take logarithmic time).
  • Ordered Operations, Traversal, Range Search: Same logic as BST/Red-Black, but with guaranteed logarithmic height. $O(log N)$ or $O(log N + k)$ or $O(N)$.

Key Takeaway: AVL trees offer the strongest height guarantee among self-balancing BSTs, potentially leading to slightly faster search times compared to Red-Black trees. However, maintaining the stricter balance may require more rotations (and thus more complex update logic and potentially slightly slower insertions/deletions in practice compared to Red-Black trees).

6. Comparison of Search Trees

Here's a summary comparing the main types of binary search trees:

Feature Basic BST Red-Black Tree AVL Tree
Balancing None Self-balancing Self-balancing
Height $O(N)$ worst-case, $O(log N)$ average-case $O(log N)$ guaranteed $O(log N)$ guaranteed (stricter bound)
get Perf. $O(height)$ $O(log N)$ $O(log N)$
put Perf. $O(height)$ $O(log N)$ $O(log N)$
delete Perf. $O(height)$ $O(log N)$ $O(log N)$
Implementation Complexity Simple Moderate/Complex Moderate/Complex
Extra Space per Node Minimal 1 bit (for color) log N bits (for height) or int (for balance factor)
Balancing Mechanism N/A Color flips + Rotations Balance Factor check + Rotations
Rotations per Update N/A Amortized $O(1)$, Worst $O(1)$ rotation (but potentially many color changes) Amortized $O(1)$, Worst $O(log N)$ rotations (on delete)
Typical Use Teaching, small static sets, known random data Standard library maps/sets (e.g., C++ std::map, Java TreeMap) Applications with frequent searches, fewer updates

In practice, Red-Black trees are often preferred in standard libraries because they offer a good balance between guaranteed logarithmic performance and implementation complexity. The number of rotations required for updates is typically less than or equal to AVL trees on average, even though AVL trees have a strictly smaller height bound.

7. Extensions: Multi-way Search Trees (B-Trees and B+Trees)

Binary search trees are excellent for data stored in main memory (RAM), where accessing any memory location takes roughly the same amount of time ( $O(1)$ on average). However, when dealing with extremely large datasets that reside on slower external storage devices like hard drives or SSDs, the cost of accessing data (disk I/O) becomes the dominant performance factor.

Disk access is orders of magnitude slower than RAM access. Binary trees, even balanced ones like Red-Black or AVL, are inefficient on disk because each node access might require a separate disk read. An $O(log N)$ height tree could still mean many disk reads for a single operation if the tree is large.

To minimize disk I/O, we need trees that are "bushier" or "shallower" – trees with a higher branching factor than 2. This leads to Multi-way Search Trees, where a single node can have more than two children (often hundreds or thousands). The goal is to make each node's size roughly equal to the size of a disk block or page, so that loading a node into memory only requires a single disk access.

B-Trees

B-Trees are multi-way search trees specifically designed for external storage. A B-Tree of order m has the following properties:

  • Each node has at most m children.
  • Each non-leaf node with k children contains k-1 keys.
  • The keys in each node are sorted.
  • Keys in the subtree between two keys $K_i$ and $K_{i+1}$ are all between $K_i$ and $K_{i+1}$. Keys in the subtree before the first key $K_1$ are less than $K_1$, and keys in the subtree after the last key $K_{k-1}$ are greater than $K_{k-1}$.
  • Every node (except root) must have at least $\lceil m/2 \rceil$ children (this ensures nodes are at least half-full). The root must have at least 2 children (unless it's the only node).
  • All leaves are at the same depth (the height of the tree).

Due to the high branching factor m, the height of a B-Tree is very small, proportional to $\log_m N$. This significantly reduces the number of disk accesses required for operations.

  • get, put, delete: Operations traverse a single path from the root to a leaf. Within each node, a binary search is performed over the keys (which is fast in memory). The dominant cost is the number of disk reads, which is equal to the height of the tree, $O(\log_m N)$.

Applications: B-Trees are widely used in databases and file systems for indexing, allowing efficient retrieval of data stored on disk.

B+Trees

B+Trees are a common variation of B-Trees, particularly favored in database systems. They have slightly different properties:

  • All keys and data records are stored in the leaves. Internal nodes contain only keys used for navigation, not data.
  • Internal nodes only store keys and pointers to child nodes. These keys are essentially duplicates of the smallest keys in the subtrees they point to, or just separator keys.
  • The leaf nodes are linked together sequentially like a sorted linked list.

Advantages of B+Trees:

  • Efficient Range Queries: Because all data is in the leaves and the leaves are linked, finding all records in a range [low, high] is very efficient. You find the first key >= low in the leaves and then traverse the linked list until you find a key > high.
  • Optimized for Storage: Internal nodes are smaller as they don't store data, allowing a higher branching factor m and thus a shallower tree.
  • Uniform Leaf Structure: All searches for records terminate at the leaf level, simplifying implementation.

Applications: B+Trees are the most common structure for primary and secondary indexes in relational databases. They are also used in file systems.

8. Conclusion

We've explored the evolution of search trees, starting with the fundamental Binary Search Tree (BST). While simple and efficient on average, the BST suffers from poor worst-case performance due to potential imbalance.

This led to the development of self-balancing BSTs like Red-Black Trees and AVL Trees. These structures use clever algorithms (involving rotations and other fix-ups) to maintain a logarithmic height guarantee, ensuring O(log N) performance for all basic operations, making them suitable for dynamic datasets where predictable performance is required. Red-Black trees are popular in standard libraries for their balance of performance and implementation complexity, while AVL trees offer a slightly stricter balance, potentially favoring read-heavy applications.

Finally, we looked at Multi-way Search Trees, like B-Trees and B+Trees, designed specifically for large datasets stored on external storage. By maximizing the branching factor and minimizing height, they drastically reduce slow disk I/O, making operations like searching and range queries efficient even for terabytes of data. B+Trees, with their leaf-node data storage and linked leaves, are particularly well-suited for database indexing.

Understanding these tree structures provides valuable insight into how large amounts of ordered data are managed efficiently in practice, forming the backbone of many computing systems you interact with daily.

Clone this wiki locally