[#9543] feat(jobs): Add the built-in Iceberg rewrite data files job template to Gravitino #9588

jerryshao · 2025-12-31T11:47:52Z

This commit implements a built-in job template for rewriting Iceberg table data files, which supports binpack, sort strategies for table optimization.

Key Features:

Named argument parser supporting flexible parameter combinations
Calls Iceberg's native rewrite_data_files stored procedure
Supports all rewrite strategies: binpack, sort
Configurable options for file sizes, thresholds, and behavior
Template-based configuration for Spark and Iceberg catalogs
Handles both Iceberg 1.6.1 (4 columns) and newer versions (5 columns)

Implementation:

IcebergRewriteDataFilesJob.java (335 lines)
- Template name: builtin-iceberg-rewrite-data-files
- Version: v1
- Arguments: --catalog, --table, --strategy, --sort-order, --where, --options
- Spark configs for runtime and Iceberg catalog setup
BuiltInJobTemplateProvider.java (modified)
- Registered new IcebergRewriteDataFilesJob
build.gradle.kts (modified)
Added Iceberg Spark runtime dependency (1.6.1)
Added Spark, Scala, and Hadoop test dependencies

Tests (41 tests, all passing):

TestIcebergRewriteDataFilesJob.java (33 tests, 429 lines)
- Template structure validation
- Argument parsing (required, optional, empty values, order-independent)
- JSON options parsing (single, multiple, boolean, empty)
- SQL generation (minimal, with strategy, sort, where, options, all params)
TestIcebergRewriteDataFilesJobWithSpark.java (8 tests, 229 lines)
Real Spark session integration tests
Executes actual Iceberg rewrite_data_files procedures
Validates data integrity after rewrite operations
Tests all parameter combinations with live Iceberg catalog

Usage Examples:

--catalog iceberg_prod --table db.sample

--catalog iceberg_prod --table db.sample --strategy sort \
 --sort-order 'id DESC NULLS LAST'

--catalog iceberg_prod --table db.sample --strategy sort \
 --sort-order 'zorder(user_id, event_type, timestamp)'

--catalog iceberg_prod --table db.sample --where 'year = 2024' \
  --options '{"min-input-files":"2","remove-dangling-deletes":"true"}'

Fix: #9543

This commit implements a built-in job template for rewriting Iceberg table data files, which supports binpack, sort, and z-order strategies for table optimization. Key Features: - Named argument parser supporting flexible parameter combinations - Calls Iceberg's native rewrite_data_files stored procedure - Supports all rewrite strategies: binpack, sort, z-order - Configurable options for file sizes, thresholds, and behavior - Template-based configuration for Spark and Iceberg catalogs - Handles both Iceberg 1.6.1 (4 columns) and newer versions (5 columns) Implementation: - IcebergRewriteDataFilesJob.java (335 lines) - Template name: builtin-iceberg-rewrite-data-files - Version: v1 - Arguments: --catalog, --table, --strategy, --sort-order, --where, --options - Spark configs for runtime and Iceberg catalog setup - BuiltInJobTemplateProvider.java (modified) - Registered new IcebergRewriteDataFilesJob - build.gradle.kts (modified) - Added Iceberg Spark runtime dependency (1.6.1) - Added Spark, Scala, and Hadoop test dependencies Tests (41 tests, all passing): - TestIcebergRewriteDataFilesJob.java (33 tests, 429 lines) - Template structure validation - Argument parsing (required, optional, empty values, order-independent) - JSON options parsing (single, multiple, boolean, empty) - SQL generation (minimal, with strategy, sort, where, options, all params) - TestIcebergRewriteDataFilesJobWithSpark.java (8 tests, 229 lines) - Real Spark session integration tests - Executes actual Iceberg rewrite_data_files procedures - Validates data integrity after rewrite operations - Tests all parameter combinations with live Iceberg catalog Usage Examples: --catalog iceberg_prod --table db.sample --catalog iceberg_prod --table db.sample --strategy sort \ --sort-order 'id DESC NULLS LAST' --catalog iceberg_prod --table db.sample --strategy sort \ --sort-order 'zorder(user_id, event_type, timestamp)' --catalog iceberg_prod --table db.sample --where 'year = 2024' \ --options '{"min-input-files":"2","remove-dangling-deletes":"true"}' Issue: apache#9543

Copilot

Pull request overview

This PR implements a new built-in job template for optimizing Iceberg table data files through rewriting operations. The implementation provides a Spark-based job that calls Iceberg's native rewrite_data_files stored procedure with support for binpack, sort, and z-order optimization strategies.

Key Changes:

New IcebergRewriteDataFilesJob class providing template-based configuration and execution logic for Iceberg table optimization
Comprehensive test suite with 41 tests covering argument parsing, SQL generation, and end-to-end Spark integration
Build configuration updates to include Iceberg Spark runtime dependencies for both compilation and testing

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
`maintenance/jobs/src/main/java/org/apache/gravitino/maintenance/jobs/iceberg/IcebergRewriteDataFilesJob.java`	Core implementation of the Iceberg rewrite data files job with argument parsing, SQL generation, and Spark execution logic
`maintenance/jobs/src/main/java/org/apache/gravitino/maintenance/jobs/BuiltInJobTemplateProvider.java`	Registers the new IcebergRewriteDataFilesJob in the built-in job template provider
`maintenance/jobs/src/test/java/org/apache/gravitino/maintenance/jobs/iceberg/TestIcebergRewriteDataFilesJob.java`	Unit tests for template structure, argument parsing, JSON options parsing, and SQL generation
`maintenance/jobs/src/test/java/org/apache/gravitino/maintenance/jobs/iceberg/TestIcebergRewriteDataFilesJobWithSpark.java`	Integration tests using real Spark session to validate generated SQL and procedure execution
`maintenance/jobs/build.gradle.kts`	Adds Iceberg Spark runtime, Spark SQL, and Hadoop dependencies for compilation and testing

.../src/main/java/org/apache/gravitino/maintenance/jobs/iceberg/IcebergRewriteDataFilesJob.java

FANNG1 · 2026-01-06T01:28:29Z

.../src/main/java/org/apache/gravitino/maintenance/jobs/iceberg/IcebergRewriteDataFilesJob.java

+   * specification --where &lt;where_clause&gt; Optional. Filter predicate --options
+   * &lt;options_json&gt; Optional. JSON map of options
+   *
+   * <p>Example: --catalog iceberg_catalog --table db.sample --strategy binpack --options


could you add --where to the example?

You can check the usage to see how to use it.

FANNG1 · 2026-01-06T01:31:47Z

.../src/main/java/org/apache/gravitino/maintenance/jobs/iceberg/IcebergRewriteDataFilesJob.java

+            + "                              For columns: 'id DESC NULLS LAST, name ASC'\n"
+            + "                              For Z-Order: 'zorder(c1,c2,c3)'\n"
+            + "  --where <predicate>       Filter predicate to select files\n"
+            + "                              Example: 'year = 2024 and month = 1'\n"


Could you add a str column to the where example?

Can you give me an example?

FANNG1 · 2026-01-06T01:38:09Z

Besides providing the argument options in main method, could you provide the document about how to pass the argument (especially for where and options) when submit a rewrite job, since there maybe ' " transform in the job internally system.

FANNG1 · 2026-01-06T01:43:52Z

Another question is, do you plan to allow users to inject custom Spark configurations?

jerryshao · 2026-01-07T12:28:57Z

Another question is, do you plan to allow users to inject custom Spark configurations?

I think this is valid, let me think of how to support it.

jerryshao · 2026-01-08T09:13:48Z

@FANNG1 please help to review again.

jerryshao added 3 commits December 31, 2025 19:21

address the comment

c176821

Add License header

165b9a4

jerryshao self-assigned this Dec 31, 2025

jerryshao requested a review from FANNG1 December 31, 2025 11:52

Fix the build issue

8f00fc6

FANNG1 requested a review from Copilot January 5, 2026 07:03

Copilot started reviewing on behalf of FANNG1 January 5, 2026 07:04 View session

Copilot AI reviewed Jan 5, 2026

View reviewed changes

FANNG1 reviewed Jan 6, 2026

View reviewed changes

Address the comment

0a63f76

jerryshao requested a review from FANNG1 January 8, 2026 04:19

[#9543] feat(jobs): Add the built-in Iceberg rewrite data files job template to Gravitino #9588

Are you sure you want to change the base?

[#9543] feat(jobs): Add the built-in Iceberg rewrite data files job template to Gravitino #9588

Uh oh!

Conversation

jerryshao commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FANNG1 Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

jerryshao Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

FANNG1 Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

jerryshao Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

FANNG1 commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FANNG1 commented Jan 6, 2026

Uh oh!

jerryshao commented Jan 7, 2026

Uh oh!

jerryshao commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jerryshao commented Dec 31, 2025 •

edited

Loading

FANNG1 commented Jan 6, 2026 •

edited

Loading