Skip to content

RFC: Distinguish path equivalence from equality #532

@kevinjacobs-delfi

Description

@kevinjacobs-delfi

Summary

The current UPath.__eq__ definition includes storage_options in equality checks. While sensible for Python object semantics, this creates unintuitive behavior for path operations. This RFC proposes distinguishing "equality" (same Python object configuration) from "equivalence" (same filesystem resource).

Note: I use S3Paths as examples throughout this issue because that's my specific use case, but the concepts should generalize to other filesystem implementations.

Background: pathlib.Path vs UPath Equality

Users migrating from pathlib.Path to UPath may expect similar equality semantics. However, there's a significant difference:

from pathlib import Path
from upath import UPath

# pathlib.Path: equality based on resolved path only
Path('/tmp/file.txt') == Path('/tmp/file.txt')  # True (always)

# UPath: equality based on protocol + path + storage_options
UPath('s3://bucket/file.txt') == UPath('s3://bucket/file.txt') 
UPath('s3://bucket/file.txt') != UPath('s3://bucket/file.txt', anon=False) 
UPath('s3://bucket/file.txt') != UPath('s3://bucket/file.txt', anon=True) 

This difference has caused subtle bugs in my codebase when code assumes paths referring to the same filesystem object will compare equal. The current behavior is only documented in the __eq__ docstring and isn't mentioned in the migration guide or concept docs.

Current Behavior

def __eq__(self, other: object) -> bool:
    """UPaths are considered equal if their protocol, path and storage_options are equal."""

All of these paths refer to the same S3 object, but are not equal:

url = 's3://mybucket/object.txt'
UPath(url) == UPath(url, anon=True)                # False - different auth
UPath(url) == UPath(url, default_block_size=1024)  # False - different performance option
UPath(url) == UPath(url, profile='test')           # False - different credentials

Problem

This equality definition causes unintuitive behavior in path operations:

p1 = UPath('s3://mybucket/a/object.txt', default_block_size=1024)
p2 = UPath('s3://mybucket/a')

p1.is_relative_to(p2)  # False - unexpected
p2 in p1.parents       # False - unexpected
p1.relative_to(p2)     # Raises ValueError: incompatible storage_options

Notably, __hash__ already ignores storage_options:

hash(UPath(url)) == hash(UPath(url, anon=True))  # True

This asymmetry suggests the current design already acknowledges that storage_options shouldn't always affect path identity.

Minimum Ask: Better Documentation

At a minimum, the equality semantics should be documented more prominently:

  • In the migration guide (especially for users coming from pathlib)
  • In the concepts documentation
  • Potentially with examples of the gotchas shown above

Currently, this behavior is only documented in the __eq__ docstring itself.

Proposed Concept: Path Equivalence

Define equivalence as paths that refer to the same filesystem resource, ignoring options that don't affect which resource is addressed:

Should NOT affect equivalence:

  • Authentication: anon, key, secret, token, profile, session (aiobotocore.Session for custom initialization), etc.
  • Performance: default_block_size, default_cache_type, max_concurrency, etc.
  • Behavior: default_acl, requester_pays, etc.

SHOULD affect equivalence:

  • Different endpoints/hosts (different services entirely)

S3 Example

def endpoint_url(url: str) -> dict[str, Any]:
    return {'client_kwargs': {'endpoint_url': url}}

url = 's3://mybucket/object.txt'

# These are equivalent (same AWS S3 resource):
UPath(url)
UPath(url, **endpoint_url('https://s3.amazonaws.com'))
UPath(url, **endpoint_url('https://s3.us-east-1.amazonaws.com'))

# NOT equivalent (different service):
UPath(url, **endpoint_url('http://localhost:9000'))  # MinIO, LocalStack, Moto, etc.

Discussion Questions

1. Should equality change, or should we add an equivalence method?

Option A: Add is_equivalent_to() method

  • Backward compatible
  • Explicit distinction between concepts
  • Methods like relative_to() could optionally use equivalence

Option B: Change __eq__ to use equivalence semantics

  • More intuitive default behavior
  • Breaking change for code relying on current behavior
  • Would require __hash__ changes for consistency

2. How should equivalence be defined per-filesystem?

A possible approach for S3:

def _equivalence_key(self) -> tuple:
    """Return a tuple that identifies the filesystem resource."""
    # Get the actual endpoint_url from the S3 client rather than storage_options
    endpoint = self.fs.s3.meta.endpoint_url
    normalized_endpoint = normalize_s3_endpoint(endpoint)
    return (self.protocol, self.__vfspath__(), normalized_endpoint)

The base UPath could default to (protocol, __vfspath__(), storage_options) to preserve current semantics until filesystems define more precise equivalence.

3. Should path methods use equivalence?

If equivalence is defined, should these methods use it by default?

  • is_relative_to()
  • relative_to()
  • parents containment check

Related

  • Current __hash__ already ignores storage_options (design precedent)
  • This affects any code doing path comparisons across UPath instances with different configurations

Co-Authored-By: Claude noreply@anthropic.com

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions