Skip to content

UnionFind

Jorge Londoño edited this page May 28, 2025 · 1 revision

Union-Find / Disjoint Set Data Structure

The Union-Find data structure, also known as the Disjoint Set Union (DSU) data structure, is a powerful tool for managing a collection of disjoint sets. It provides an efficient way to keep track of partitions of a set into non-overlapping subsets.

At its core, Union-Find solves the problem of maintaining dynamic sets and supporting two primary operations on these sets:

  1. Find: Determine which set a particular element belongs to.
  2. Union: Merge two sets into a single set.

This data structure is particularly useful for problems involving connectivity. Consider a collection of elements, where relationships (like "connected to") can be established between pairs of elements over time. We are interested in determining if any two elements are connected (possibly indirectly through a path of relationships) or in grouping elements into connected components.

The "connected to" relationship typically has the following properties:

  • Reflexivity: An element is connected to itself.
  • Symmetry: If A is connected to B, then B is connected to A.
  • Transitivity: If A is connected to B, and B is connected to C, then A is connected to C.

These properties define an equivalence relation. An equivalence relation partitions a set into disjoint subsets called equivalence classes. In the context of connectivity, these equivalence classes are the connected components. The Union-Find data structure naturally models these disjoint connected components.

Practical Examples where Union-Find is applicable:

  • Kruskal's Algorithm for Minimum Spanning Tree (MST): Used to detect cycles efficiently. When considering an edge (u, v), if u and v are already in the same connected component (same set), adding the edge would create a cycle. Otherwise, they are in different sets, and the edge can be added, merging the two components.
  • Finding Connected Components in a Graph: Given a graph, Union-Find can quickly determine which nodes belong to the same connected component.
  • Image Processing: Analyzing connected regions of pixels.
  • Network Connectivity: Determining if two nodes in a network are connected.
  • Percolation: Studying the properties of random graphs and connectivity.

The Union-Find Abstract Data Type (ADT)

The Union-Find data structure operates on a fixed number of elements, let's say N, which are often represented by integers from 0 to N-1. The core operations defining the Union-Find ADT are:

  • UnionFind(N):
    • Purpose: Creates a Union-Find structure for N elements.
    • Initial State: Initially, each element is in its own separate set. There are N sets.
  • find(p):
    • Purpose: Returns the representative (or root) of the set containing element p.
    • Property: find(p) and find(q) return the same value if and only if p and q are in the same set.
  • union(p, q):
    • Purpose: Merges the sets containing elements p and q into a single set.
    • Effect: After union(p, q), p and q will be in the same set (i.e., find(p) == find(q)). If they were already in the same set, this operation does nothing.
  • connected(p, q):
    • Purpose: Checks if elements p and q are in the same set.
    • Implementation: Typically implemented as return find(p) == find(q).
  • count():
    • Purpose: Returns the total number of disjoint sets currently managed by the data structure.
    • Initial State: Starts at N. Decreases by 1 for each successful union operation that merges two distinct sets.

Implementation Alternatives for Union-Find

The efficiency of Union-Find operations (find and union) depends heavily on the underlying data structure used to represent the sets. We will explore several common implementations, evolving from simple but inefficient approaches to highly optimized ones.

Implementation 1: Quick-Find (Using Arrays)

This implementation prioritizes fast find and connected operations.

  • Data Structure: An integer array id of size N. id[i] stores the component identifier (an integer) for element i.
    • Representation: All elements belonging to the same set have the same value in the id array.
    • Initial State: id[i] = i for all i from 0 to N-1. Initially, each element is in its own set, and its identifier is itself.
  • Operations:
    • find(p): Returns id[p]. This is very fast.
      def find(self, p):
          return self.id[p]
    • connected(p, q): Checks if id[p] == id[q]. Also very fast.
      def connected(self, p, q):
          return self.find(p) == self.find(q)
    • union(p, q): This is the costly operation. To merge the set containing p and the set containing q:
      1. Find the component identifiers for p and q (pid = self.find(p), qid = self.find(q)).
      2. If pid == qid, they are already in the same set; do nothing.
      3. If pid != qid, iterate through the entire id array. For every element i, if id[i] is equal to pid, change it to qid. This effectively re-labels all elements in p's set to have q's set identifier.
      def union(self, p, q):
          pid = self.find(p)
          qid = self.find(q)
      
          if pid == qid:
              return # Already in the same set
      
          # Iterate through all elements and change component IDs
          for i in range(len(self.id)):
              if self.id[i] == pid:
                  self.id[i] = qid
          self._count -= 1 # Decrement the number of sets
  • Complexity Analysis:
    • find, connected: O(1). Accessing an array element is constant time.
    • union: O(N). In the worst case, we might iterate through the entire array of size N.
  • Drawback: While find is fast, the union operation is very slow. Performing N union operations could take O(N^2) time in the worst case (e.g., merging sets one by one, each requiring a full array scan). This is inefficient for large N.

Implementation 2: Quick-Union (Using Trees)

This implementation uses a tree structure to represent each set. The root of the tree serves as the representative for the set.

  • Data Structure: An integer array id of size N. id[i] stores the parent of element i.
    • Representation: A root element r is its own parent, meaning id[r] == r. Following parent pointers from any element i eventually leads to the root of its tree (and thus, its set representative).
    • Initial State: id[i] = i for all i from 0 to N-1. Each element is initially a root of its own single-node tree.
  • Operations:
    • find(p): To find the representative of p, traverse up the tree from p by following parent pointers (id[p], id[id[p]], ...) until a root is reached (an element r where id[r] == r).
      def find(self, p):
          # Traverse up until root is found
          while p != self.id[p]:
              p = self.id[p]
          return p
    • connected(p, q): Check if find(p) == find(q).
      def connected(self, p, q):
          return self.find(p) == self.find(q)
    • union(p, q): To merge the sets containing p and q:
      1. Find the roots of p and q (rootP = self.find(p), rootQ = self.find(q)).
      2. If rootP == rootQ, they are already in the same set; do nothing.
      3. If rootP != rootQ, merge the two trees by setting the parent of one root to the other root. For example, set id[rootP] = rootQ. This connects the tree rooted at rootP to the tree rooted at rootQ.
      def union(self, p, q):
          rootP = self.find(p)
          rootQ = self.find(q)
      
          if rootP == rootQ:
              return # Already in the same set
      
          # Attach rootP's tree to rootQ's tree
          self.id[rootP] = rootQ
          self._count -= 1 # Decrement the number of sets
  • Complexity Analysis:
    • find, connected, union: O(depth of tree). The time taken is proportional to the height of the tree containing the elements. In the worst case (e.g., union operations always creating a long, unbalanced tree resembling a linked list), the depth can be N. So, worst-case complexity is O(N).
  • Drawback: While union is faster on average than Quick-Find's O(N) union (it only involves finding roots and changing one parent pointer), the worst-case time for all operations is still O(N) if the trees become tall and degenerate.

Implementation 3: Weighted Quick-Union (Balancing Trees by Size or Rank)

To improve Quick-Union's worst-case performance, we can avoid creating tall trees by always attaching the smaller tree to the larger tree during the union operation. This strategy is called "weighting" or "balancing". We can balance by the number of nodes (size) or by the height/rank of the trees. Balancing by size is slightly easier to implement.

  • Data Structure:
    • id array: Same as Quick-Union (id[i] is parent of i).
    • sz array: An integer array of size N, where sz[i] stores the number of nodes in the tree rooted at i. This value is only relevant if i is a root (id[i] == i).
    • Initial State: id[i] = i and sz[i] = 1 for all i from 0 to N-1. Each element is initially a root of a tree of size 1.
  • Operations:
    • find(p): Same as Quick-Union. O(depth).
      def find(self, p):
          while p != self.id[p]:
              p = self.id[p]
          return p
    • connected(p, q): Same as Quick-Union. O(depth).
      def connected(self, p, q):
          return self.find(p) == self.find(q)
    • union(p, q):
      1. Find the roots of p and q (rootP = self.find(p), rootQ = self.find(q)).
      2. If rootP == rootQ, return.
      3. If rootP != rootQ, compare the sizes of the trees rooted at rootP and rootQ (self.sz[rootP] vs self.sz[rootQ]).
      4. Attach the root of the smaller tree to the root of the larger tree.
      5. Update the size of the new root (add the size of the smaller tree to the size of the larger tree).
      def union(self, p, q):
          rootP = self.find(p)
          rootQ = self.find(q)
      
          if rootP == rootQ:
              return # Already in the same set
      
          # Attach smaller tree to larger tree's root
          if self.sz[rootP] < self.sz[rootQ]:
              self.id[rootP] = rootQ
              self.sz[rootQ] += self.sz[rootP]
          else:
              self.id[rootQ] = rootP
              self.sz[rootP] += self.sz[rootQ]
      
          self._count -= 1 # Decrement the number of sets
  • Complexity Analysis:
    • find, connected, union: O(log N). By attaching the smaller tree to the larger, the maximum depth of any node is guaranteed to be logarithmic. When we merge a tree of size s_1 with a tree of size s_2, the size of the new tree is s_1 + s_2. A node's depth increases by 1 only when its tree is attached to a larger tree. Each time a node's depth increases, the size of the tree it belongs to at least doubles. A tree rooted at a node containing element i cannot double in size more than log₂N times before reaching size N. Therefore, the maximum depth is O(log N).
  • Benefit: This is a significant improvement over basic Quick-Union, providing logarithmic time complexity for all operations.

Optimization: Path Compression

Path compression is an optimization that can be applied to the find operation in tree-based implementations (Quick-Union or Weighted Quick-Union). Its goal is to flatten the tree structure by making nodes directly point to the root.

  • Technique: During a find(p) operation, as we traverse up the tree from p to find its root, we can update the parent pointer of each node visited along that path to point directly to the root.
  • Illustration:
    • Imagine finding the root of element p. You traverse p -> parent(p) -> parent(parent(p)) -> ... -> root.
    • Once the root is found, go back down the path you just traversed. For each node x on the path (excluding the root), set id[x] = root.
  • Implementation (Recursive approach is often cleanest):
    def find(self, p):
        # Base case: p is the root
        if p == self.id[p]:
            return p
        # Recursive step: Find the root of the parent
        root = self.find(self.id[p])
        # Path compression: Set parent of p directly to the root
        self.id[p] = root
        return root
    An iterative two-pass approach is also possible: first pass to find the root, second pass to update pointers. A slightly simpler one-pass iterative approach is also common, setting id[i] = id[id[i]] (halving the path) during traversal, though this isn't full compression. The recursive version shown achieves full path compression.
  • Effect: Path compression doesn't change the worst-case time of a single find operation (it could still traverse a long path), but it significantly improves the performance of future find operations involving any node on the compressed path or its descendants. It effectively flattens the tree over time.

Implementation 4: Weighted Quick-Union with Path Compression

Combining Weighted Quick-Union (using size or rank) with Path Compression yields the most efficient and standard implementation of the Union-Find data structure.

  • Algorithms:
    • id array: Stores parent pointers.
    • sz array: Stores sizes of trees rooted at i (if i is a root).
    • find(p): Use the path compression algorithm (recursive or iterative).
    • union(p, q): Use the weighted union algorithm (attach smaller tree's root to larger tree's root, update size), but call the path-compressed find within it.
  • Example find (recursive path compression):
    def find(self, p):
        if p == self.id[p]:
            return p
        self.id[p] = self.find(self.id[p]) # Path compression
        return self.id[p]
  • Example union (calls path-compressed find):
    def union(self, p, q):
        rootP = self.find(p) # Uses path-compressed find
        rootQ = self.find(q) # Uses path-compressed find
    
        if rootP == rootQ:
            return # Already in the same set
    
        # Weighted union
        if self.sz[rootP] < self.sz[rootQ]:
            self.id[rootP] = rootQ
            self.sz[rootQ] += self.sz[rootP]
        else:
            self.id[rootQ] = rootP
            self.sz[rootP] += self.sz[rootQ]
    
        self._count -= 1
  • Amortized Complexity Analysis:
    • The combination of weighted union and path compression is remarkably efficient. The amortized time complexity for find, connected, and union operations on a structure with N elements is O(α(N)), where α is the inverse Ackermann function.
    • The inverse Ackermann function grows extremely slowly. For any practically obtainable number of elements N, α(N) is less than 5. This means that, on average over a sequence of operations, each operation takes nearly constant time.
    • Amortized Analysis: It's important to understand that this is an amortized bound. A single operation might still take slightly longer (e.g., traversing a path before compression), but over a sequence of many operations, the total time divided by the number of operations approaches this nearly constant value. Path compression makes subsequent operations faster, compensating for the cost of earlier ones.
  • Conclusion: This is the standard and most efficient implementation of the Union-Find data structure, widely used in practice.

Summary and Further Considerations

The Union-Find data structure is a classic example of how combining simple ideas (array representation, tree representation) with clever optimizations (weighting, path compression) can lead to extremely efficient algorithms.

We started with Quick-Find (O(1) find, O(N) union), moved to Quick-Union (O(N) worst-case for all), improved to Weighted Quick-Union (O(log N) for all), and finally, by adding Path Compression, achieved the standard implementation with O(α(N)) amortized time per operation, effectively making operations constant time for practical purposes.

The Union-Find data structure with weighted union and path compression is a cornerstone in algorithms dealing with connectivity, set partitioning, and equivalence relations, proving indispensable in areas like graph algorithms (especially MST) and network problems.

While we focused on weighted union by size, weighted union by rank (tree height) provides the same O(α(N)) complexity. Sometimes, the structure might need to store additional data associated with each set (e.g., the sum of values in the set, the minimum element). This can be added to the root node and updated during the union operation.

Understanding Union-Find provides insight into efficient data structure design and the power of amortized analysis.

Clone this wiki locally