-
Notifications
You must be signed in to change notification settings - Fork 366
Description
Is your feature request related to a problem? Please describe.
The DEFAULT_LOCATION_OBJECT_STORAGE_PREFIX_ENABLED feature configuration could benefit from
improved documentation to clarify its purpose, limitations, and relationship with Iceberg's
write.object-storage.enabled feature.
Background
Polaris has a feature config called DEFAULT_LOCATION_OBJECT_STORAGE_PREFIX_ENABLED introduced in
bd83252.
The feature name and current description may lead users to believe it provides similar functionality
to Iceberg's Object Store File Layout, but the two features work at different levels and are designed to be complementary.
How the features differ
Iceberg's write.object-storage.enabled applies entropy (hash-based prefix) on a per-file
basis. Each file gets a unique hash prefix.
Polaris's DEFAULT_LOCATION_OBJECT_STORAGE_PREFIX_ENABLED applies entropy once per table, based on the table identifier. All files in the same table share the same hash.
Example
Consider two data files in a table newdb.newtable:
Standard layout (no entropy):
s3://bucket/warehouse/newdb/newtable/data/file1.parquet
s3://bucket/warehouse/newdb/newtable/data/file2.parquet
With Iceberg's object store layout only (per-file entropy):
s3://bucket/warehouse/newdb/newtable/data/0011/0100/1011/11101010/file1.parquet
s3://bucket/warehouse/newdb/newtable/data/0011/0001/0001/00000001/file2.parquet
With Polaris's object storage prefix only (per-table entropy):
s3://bucket/warehouse/1111/1111/0100/01010000/newdb/newtable/data/file1.parquet
s3://bucket/warehouse/1111/1111/0100/01010000/newdb/newtable/data/file2.parquet
With both features combined:
s3://bucket/warehouse/1111/1111/0100/01010000/newdb/newtable/data/0011/0100/1011/11101010/file1.parquet
s3://bucket/warehouse/1111/1111/0100/01010000/newdb/newtable/data/0011/0001/0001/00000001/file2.parquet
Describe the solution you'd like
The documentation should clarify:
- Purpose: Polaris's layout distributes different tables across the key space, preventing hotspots when multiple tables in the same namespace are accessed concurrently. It does not distribute files within a single table.
- Limitations: since all files in a table share the same prefix, this layout alone does not prevent hotspots when a single table receives heavy write traffic.
- Complementary usage: as stated in the original commit, "The two features can and should be combined to achieve the best distribution of data files throughout the key space."
Describe alternatives you've considered
No response
Additional context
No response