-
Notifications
You must be signed in to change notification settings - Fork 707
[#9543] feat(jobs): Add the built-in Iceberg rewrite data files job template to Gravitino #9588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This commit implements a built-in job template for rewriting Iceberg table
data files, which supports binpack, sort, and z-order strategies for table
optimization.
Key Features:
- Named argument parser supporting flexible parameter combinations
- Calls Iceberg's native rewrite_data_files stored procedure
- Supports all rewrite strategies: binpack, sort, z-order
- Configurable options for file sizes, thresholds, and behavior
- Template-based configuration for Spark and Iceberg catalogs
- Handles both Iceberg 1.6.1 (4 columns) and newer versions (5 columns)
Implementation:
- IcebergRewriteDataFilesJob.java (335 lines)
- Template name: builtin-iceberg-rewrite-data-files
- Version: v1
- Arguments: --catalog, --table, --strategy, --sort-order, --where, --options
- Spark configs for runtime and Iceberg catalog setup
- BuiltInJobTemplateProvider.java (modified)
- Registered new IcebergRewriteDataFilesJob
- build.gradle.kts (modified)
- Added Iceberg Spark runtime dependency (1.6.1)
- Added Spark, Scala, and Hadoop test dependencies
Tests (41 tests, all passing):
- TestIcebergRewriteDataFilesJob.java (33 tests, 429 lines)
- Template structure validation
- Argument parsing (required, optional, empty values, order-independent)
- JSON options parsing (single, multiple, boolean, empty)
- SQL generation (minimal, with strategy, sort, where, options, all params)
- TestIcebergRewriteDataFilesJobWithSpark.java (8 tests, 229 lines)
- Real Spark session integration tests
- Executes actual Iceberg rewrite_data_files procedures
- Validates data integrity after rewrite operations
- Tests all parameter combinations with live Iceberg catalog
Usage Examples:
--catalog iceberg_prod --table db.sample
--catalog iceberg_prod --table db.sample --strategy sort \
--sort-order 'id DESC NULLS LAST'
--catalog iceberg_prod --table db.sample --strategy sort \
--sort-order 'zorder(user_id, event_type, timestamp)'
--catalog iceberg_prod --table db.sample --where 'year = 2024' \
--options '{"min-input-files":"2","remove-dangling-deletes":"true"}'
Issue: apache#9543
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a new built-in job template for optimizing Iceberg table data files through rewriting operations. The implementation provides a Spark-based job that calls Iceberg's native rewrite_data_files stored procedure with support for binpack, sort, and z-order optimization strategies.
Key Changes:
- New
IcebergRewriteDataFilesJobclass providing template-based configuration and execution logic for Iceberg table optimization - Comprehensive test suite with 41 tests covering argument parsing, SQL generation, and end-to-end Spark integration
- Build configuration updates to include Iceberg Spark runtime dependencies for both compilation and testing
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
maintenance/jobs/src/main/java/org/apache/gravitino/maintenance/jobs/iceberg/IcebergRewriteDataFilesJob.java |
Core implementation of the Iceberg rewrite data files job with argument parsing, SQL generation, and Spark execution logic |
maintenance/jobs/src/main/java/org/apache/gravitino/maintenance/jobs/BuiltInJobTemplateProvider.java |
Registers the new IcebergRewriteDataFilesJob in the built-in job template provider |
maintenance/jobs/src/test/java/org/apache/gravitino/maintenance/jobs/iceberg/TestIcebergRewriteDataFilesJob.java |
Unit tests for template structure, argument parsing, JSON options parsing, and SQL generation |
maintenance/jobs/src/test/java/org/apache/gravitino/maintenance/jobs/iceberg/TestIcebergRewriteDataFilesJobWithSpark.java |
Integration tests using real Spark session to validate generated SQL and procedure execution |
maintenance/jobs/build.gradle.kts |
Adds Iceberg Spark runtime, Spark SQL, and Hadoop dependencies for compilation and testing |
.../src/main/java/org/apache/gravitino/maintenance/jobs/iceberg/IcebergRewriteDataFilesJob.java
Show resolved
Hide resolved
.../src/main/java/org/apache/gravitino/maintenance/jobs/iceberg/IcebergRewriteDataFilesJob.java
Show resolved
Hide resolved
.../src/main/java/org/apache/gravitino/maintenance/jobs/iceberg/IcebergRewriteDataFilesJob.java
Show resolved
Hide resolved
.../src/main/java/org/apache/gravitino/maintenance/jobs/iceberg/IcebergRewriteDataFilesJob.java
Show resolved
Hide resolved
.../src/main/java/org/apache/gravitino/maintenance/jobs/iceberg/IcebergRewriteDataFilesJob.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/gravitino/maintenance/jobs/iceberg/IcebergRewriteDataFilesJob.java
Show resolved
Hide resolved
.../src/main/java/org/apache/gravitino/maintenance/jobs/iceberg/IcebergRewriteDataFilesJob.java
Show resolved
Hide resolved
.../src/main/java/org/apache/gravitino/maintenance/jobs/iceberg/IcebergRewriteDataFilesJob.java
Show resolved
Hide resolved
.../src/main/java/org/apache/gravitino/maintenance/jobs/iceberg/IcebergRewriteDataFilesJob.java
Show resolved
Hide resolved
| * specification --where <where_clause> Optional. Filter predicate --options | ||
| * <options_json> Optional. JSON map of options | ||
| * | ||
| * <p>Example: --catalog iceberg_catalog --table db.sample --strategy binpack --options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add --where to the example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can check the usage to see how to use it.
| + " For columns: 'id DESC NULLS LAST, name ASC'\n" | ||
| + " For Z-Order: 'zorder(c1,c2,c3)'\n" | ||
| + " --where <predicate> Filter predicate to select files\n" | ||
| + " Example: 'year = 2024 and month = 1'\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a str column to the where example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give me an example?
|
Besides providing the argument options in main method, could you provide the document about how to pass the argument (especially for where and options) when submit a rewrite job, since there maybe |
|
Another question is, do you plan to allow users to inject custom Spark configurations? |
I think this is valid, let me think of how to support it. |
|
@FANNG1 please help to review again. |
This commit implements a built-in job template for rewriting Iceberg table data files, which supports binpack, sort strategies for table optimization.
Key Features:
Implementation:
Tests (41 tests, all passing):
TestIcebergRewriteDataFilesJob.java (33 tests, 429 lines)
TestIcebergRewriteDataFilesJobWithSpark.java (8 tests, 229 lines)
Real Spark session integration tests
Executes actual Iceberg rewrite_data_files procedures
Validates data integrity after rewrite operations
Tests all parameter combinations with live Iceberg catalog
Usage Examples:
Fix: #9543