Skip to content

Add documentation for DEFAULT_LOCATION_OBJECT_STORAGE_PREFIX_ENABLED #3621

@adutra

Description

@adutra

Is your feature request related to a problem? Please describe.

The DEFAULT_LOCATION_OBJECT_STORAGE_PREFIX_ENABLED feature configuration could benefit from
improved documentation to clarify its purpose, limitations, and relationship with Iceberg's
write.object-storage.enabled feature.

Background

Polaris has a feature config called DEFAULT_LOCATION_OBJECT_STORAGE_PREFIX_ENABLED introduced in
bd83252.

The feature name and current description may lead users to believe it provides similar functionality
to Iceberg's Object Store File Layout, but the two features work at different levels and are designed to be complementary.

How the features differ

Iceberg's write.object-storage.enabled applies entropy (hash-based prefix) on a per-file
basis. Each file gets a unique hash prefix.

Polaris's DEFAULT_LOCATION_OBJECT_STORAGE_PREFIX_ENABLED applies entropy once per table, based on the table identifier. All files in the same table share the same hash.

Example

Consider two data files in a table newdb.newtable:

Standard layout (no entropy):

s3://bucket/warehouse/newdb/newtable/data/file1.parquet
s3://bucket/warehouse/newdb/newtable/data/file2.parquet

With Iceberg's object store layout only (per-file entropy):

s3://bucket/warehouse/newdb/newtable/data/0011/0100/1011/11101010/file1.parquet
s3://bucket/warehouse/newdb/newtable/data/0011/0001/0001/00000001/file2.parquet

With Polaris's object storage prefix only (per-table entropy):

s3://bucket/warehouse/1111/1111/0100/01010000/newdb/newtable/data/file1.parquet
s3://bucket/warehouse/1111/1111/0100/01010000/newdb/newtable/data/file2.parquet

With both features combined:

s3://bucket/warehouse/1111/1111/0100/01010000/newdb/newtable/data/0011/0100/1011/11101010/file1.parquet
s3://bucket/warehouse/1111/1111/0100/01010000/newdb/newtable/data/0011/0001/0001/00000001/file2.parquet

Describe the solution you'd like

The documentation should clarify:

  1. Purpose: Polaris's layout distributes different tables across the key space, preventing hotspots when multiple tables in the same namespace are accessed concurrently. It does not distribute files within a single table.
  2. Limitations: since all files in a table share the same prefix, this layout alone does not prevent hotspots when a single table receives heavy write traffic.
  3. Complementary usage: as stated in the original commit, "The two features can and should be combined to achieve the best distribution of data files throughout the key space."

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation, especially web site contentenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions