-
Notifications
You must be signed in to change notification settings - Fork 50
Description
Summary
The current UPath.__eq__ definition includes storage_options in equality checks. While sensible for Python object semantics, this creates unintuitive behavior for path operations. This RFC proposes distinguishing "equality" (same Python object configuration) from "equivalence" (same filesystem resource).
Note: I use S3Paths as examples throughout this issue because that's my specific use case, but the concepts should generalize to other filesystem implementations.
Background: pathlib.Path vs UPath Equality
Users migrating from pathlib.Path to UPath may expect similar equality semantics. However, there's a significant difference:
from pathlib import Path
from upath import UPath
# pathlib.Path: equality based on resolved path only
Path('/tmp/file.txt') == Path('/tmp/file.txt') # True (always)
# UPath: equality based on protocol + path + storage_options
UPath('s3://bucket/file.txt') == UPath('s3://bucket/file.txt')
UPath('s3://bucket/file.txt') != UPath('s3://bucket/file.txt', anon=False)
UPath('s3://bucket/file.txt') != UPath('s3://bucket/file.txt', anon=True) This difference has caused subtle bugs in my codebase when code assumes paths referring to the same filesystem object will compare equal. The current behavior is only documented in the __eq__ docstring and isn't mentioned in the migration guide or concept docs.
Current Behavior
def __eq__(self, other: object) -> bool:
"""UPaths are considered equal if their protocol, path and storage_options are equal."""All of these paths refer to the same S3 object, but are not equal:
url = 's3://mybucket/object.txt'
UPath(url) == UPath(url, anon=True) # False - different auth
UPath(url) == UPath(url, default_block_size=1024) # False - different performance option
UPath(url) == UPath(url, profile='test') # False - different credentialsProblem
This equality definition causes unintuitive behavior in path operations:
p1 = UPath('s3://mybucket/a/object.txt', default_block_size=1024)
p2 = UPath('s3://mybucket/a')
p1.is_relative_to(p2) # False - unexpected
p2 in p1.parents # False - unexpected
p1.relative_to(p2) # Raises ValueError: incompatible storage_optionsNotably, __hash__ already ignores storage_options:
hash(UPath(url)) == hash(UPath(url, anon=True)) # TrueThis asymmetry suggests the current design already acknowledges that storage_options shouldn't always affect path identity.
Minimum Ask: Better Documentation
At a minimum, the equality semantics should be documented more prominently:
- In the migration guide (especially for users coming from pathlib)
- In the concepts documentation
- Potentially with examples of the gotchas shown above
Currently, this behavior is only documented in the __eq__ docstring itself.
Proposed Concept: Path Equivalence
Define equivalence as paths that refer to the same filesystem resource, ignoring options that don't affect which resource is addressed:
Should NOT affect equivalence:
- Authentication:
anon,key,secret,token,profile,session(aiobotocore.Session for custom initialization), etc. - Performance:
default_block_size,default_cache_type,max_concurrency, etc. - Behavior:
default_acl,requester_pays, etc.
SHOULD affect equivalence:
- Different endpoints/hosts (different services entirely)
S3 Example
def endpoint_url(url: str) -> dict[str, Any]:
return {'client_kwargs': {'endpoint_url': url}}
url = 's3://mybucket/object.txt'
# These are equivalent (same AWS S3 resource):
UPath(url)
UPath(url, **endpoint_url('https://s3.amazonaws.com'))
UPath(url, **endpoint_url('https://s3.us-east-1.amazonaws.com'))
# NOT equivalent (different service):
UPath(url, **endpoint_url('http://localhost:9000')) # MinIO, LocalStack, Moto, etc.Discussion Questions
1. Should equality change, or should we add an equivalence method?
Option A: Add is_equivalent_to() method
- Backward compatible
- Explicit distinction between concepts
- Methods like
relative_to()could optionally use equivalence
Option B: Change __eq__ to use equivalence semantics
- More intuitive default behavior
- Breaking change for code relying on current behavior
- Would require
__hash__changes for consistency
2. How should equivalence be defined per-filesystem?
A possible approach for S3:
def _equivalence_key(self) -> tuple:
"""Return a tuple that identifies the filesystem resource."""
# Get the actual endpoint_url from the S3 client rather than storage_options
endpoint = self.fs.s3.meta.endpoint_url
normalized_endpoint = normalize_s3_endpoint(endpoint)
return (self.protocol, self.__vfspath__(), normalized_endpoint)The base UPath could default to (protocol, __vfspath__(), storage_options) to preserve current semantics until filesystems define more precise equivalence.
3. Should path methods use equivalence?
If equivalence is defined, should these methods use it by default?
is_relative_to()relative_to()parentscontainment check
Related
- Current
__hash__already ignoresstorage_options(design precedent) - This affects any code doing path comparisons across UPath instances with different configurations
Co-Authored-By: Claude noreply@anthropic.com