Skip to content

Conversation

@jerryshao
Copy link
Contributor

@jerryshao jerryshao commented Dec 31, 2025

This commit implements a built-in job template for rewriting Iceberg table data files, which supports binpack, sort strategies for table optimization.

Key Features:

  • Named argument parser supporting flexible parameter combinations
  • Calls Iceberg's native rewrite_data_files stored procedure
  • Supports all rewrite strategies: binpack, sort
  • Configurable options for file sizes, thresholds, and behavior
  • Template-based configuration for Spark and Iceberg catalogs
  • Handles both Iceberg 1.6.1 (4 columns) and newer versions (5 columns)

Implementation:

  • IcebergRewriteDataFilesJob.java (335 lines)
    • Template name: builtin-iceberg-rewrite-data-files
    • Version: v1
    • Arguments: --catalog, --table, --strategy, --sort-order, --where, --options
    • Spark configs for runtime and Iceberg catalog setup
  • BuiltInJobTemplateProvider.java (modified)
    • Registered new IcebergRewriteDataFilesJob
  • build.gradle.kts (modified)
  • Added Iceberg Spark runtime dependency (1.6.1)
  • Added Spark, Scala, and Hadoop test dependencies

Tests (41 tests, all passing):

  • TestIcebergRewriteDataFilesJob.java (33 tests, 429 lines)

    • Template structure validation
    • Argument parsing (required, optional, empty values, order-independent)
    • JSON options parsing (single, multiple, boolean, empty)
    • SQL generation (minimal, with strategy, sort, where, options, all params)
  • TestIcebergRewriteDataFilesJobWithSpark.java (8 tests, 229 lines)

  • Real Spark session integration tests

  • Executes actual Iceberg rewrite_data_files procedures

  • Validates data integrity after rewrite operations

  • Tests all parameter combinations with live Iceberg catalog

Usage Examples:

--catalog iceberg_prod --table db.sample

--catalog iceberg_prod --table db.sample --strategy sort \
 --sort-order 'id DESC NULLS LAST'

--catalog iceberg_prod --table db.sample --strategy sort \
 --sort-order 'zorder(user_id, event_type, timestamp)'

--catalog iceberg_prod --table db.sample --where 'year = 2024' \
  --options '{"min-input-files":"2","remove-dangling-deletes":"true"}'

Fix: #9543

This commit implements a built-in job template for rewriting Iceberg table
data files, which supports binpack, sort, and z-order strategies for table
optimization.

Key Features:
- Named argument parser supporting flexible parameter combinations
- Calls Iceberg's native rewrite_data_files stored procedure
- Supports all rewrite strategies: binpack, sort, z-order
- Configurable options for file sizes, thresholds, and behavior
- Template-based configuration for Spark and Iceberg catalogs
- Handles both Iceberg 1.6.1 (4 columns) and newer versions (5 columns)

Implementation:
- IcebergRewriteDataFilesJob.java (335 lines)
  - Template name: builtin-iceberg-rewrite-data-files
  - Version: v1
  - Arguments: --catalog, --table, --strategy, --sort-order, --where, --options
  - Spark configs for runtime and Iceberg catalog setup

- BuiltInJobTemplateProvider.java (modified)
  - Registered new IcebergRewriteDataFilesJob

- build.gradle.kts (modified)
  - Added Iceberg Spark runtime dependency (1.6.1)
  - Added Spark, Scala, and Hadoop test dependencies

Tests (41 tests, all passing):
- TestIcebergRewriteDataFilesJob.java (33 tests, 429 lines)
  - Template structure validation
  - Argument parsing (required, optional, empty values, order-independent)
  - JSON options parsing (single, multiple, boolean, empty)
  - SQL generation (minimal, with strategy, sort, where, options, all params)

- TestIcebergRewriteDataFilesJobWithSpark.java (8 tests, 229 lines)
  - Real Spark session integration tests
  - Executes actual Iceberg rewrite_data_files procedures
  - Validates data integrity after rewrite operations
  - Tests all parameter combinations with live Iceberg catalog

Usage Examples:
--catalog iceberg_prod --table db.sample

--catalog iceberg_prod --table db.sample --strategy sort \
  --sort-order 'id DESC NULLS LAST'

--catalog iceberg_prod --table db.sample --strategy sort \
  --sort-order 'zorder(user_id, event_type, timestamp)'

--catalog iceberg_prod --table db.sample --where 'year = 2024' \
  --options '{"min-input-files":"2","remove-dangling-deletes":"true"}'

Issue: apache#9543
@jerryshao jerryshao self-assigned this Dec 31, 2025
@jerryshao jerryshao requested a review from FANNG1 December 31, 2025 11:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a new built-in job template for optimizing Iceberg table data files through rewriting operations. The implementation provides a Spark-based job that calls Iceberg's native rewrite_data_files stored procedure with support for binpack, sort, and z-order optimization strategies.

Key Changes:

  • New IcebergRewriteDataFilesJob class providing template-based configuration and execution logic for Iceberg table optimization
  • Comprehensive test suite with 41 tests covering argument parsing, SQL generation, and end-to-end Spark integration
  • Build configuration updates to include Iceberg Spark runtime dependencies for both compilation and testing

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
maintenance/jobs/src/main/java/org/apache/gravitino/maintenance/jobs/iceberg/IcebergRewriteDataFilesJob.java Core implementation of the Iceberg rewrite data files job with argument parsing, SQL generation, and Spark execution logic
maintenance/jobs/src/main/java/org/apache/gravitino/maintenance/jobs/BuiltInJobTemplateProvider.java Registers the new IcebergRewriteDataFilesJob in the built-in job template provider
maintenance/jobs/src/test/java/org/apache/gravitino/maintenance/jobs/iceberg/TestIcebergRewriteDataFilesJob.java Unit tests for template structure, argument parsing, JSON options parsing, and SQL generation
maintenance/jobs/src/test/java/org/apache/gravitino/maintenance/jobs/iceberg/TestIcebergRewriteDataFilesJobWithSpark.java Integration tests using real Spark session to validate generated SQL and procedure execution
maintenance/jobs/build.gradle.kts Adds Iceberg Spark runtime, Spark SQL, and Hadoop dependencies for compilation and testing

* specification --where <where_clause> Optional. Filter predicate --options
* <options_json> Optional. JSON map of options
*
* <p>Example: --catalog iceberg_catalog --table db.sample --strategy binpack --options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add --where to the example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can check the usage to see how to use it.

+ " For columns: 'id DESC NULLS LAST, name ASC'\n"
+ " For Z-Order: 'zorder(c1,c2,c3)'\n"
+ " --where <predicate> Filter predicate to select files\n"
+ " Example: 'year = 2024 and month = 1'\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a str column to the where example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give me an example?

@FANNG1
Copy link
Contributor

FANNG1 commented Jan 6, 2026

Besides providing the argument options in main method, could you provide the document about how to pass the argument (especially for where and options) when submit a rewrite job, since there maybe ' " transform in the job internally system.

@FANNG1
Copy link
Contributor

FANNG1 commented Jan 6, 2026

Another question is, do you plan to allow users to inject custom Spark configurations?

@jerryshao
Copy link
Contributor Author

Another question is, do you plan to allow users to inject custom Spark configurations?

I think this is valid, let me think of how to support it.

@jerryshao jerryshao requested a review from FANNG1 January 8, 2026 04:19
@jerryshao
Copy link
Contributor Author

@FANNG1 please help to review again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Add an Iceberg rewrite datafile built-in job in Gravitino

2 participants