-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark 4.1: Optimize ExpireSnapshotsSparkAction with manifest-level filtering #15154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Joy Haldar <joy.haldar@target.com>
…oin-based filtering
…oin-based filtering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes ExpireSnapshotsSparkAction by replacing driver-side collection with distributed Spark operations for manifest filtering. Instead of reading content files from all manifests in expired snapshots, the implementation now filters at the manifest level first using join-based operations, then reads content files only from orphaned manifests.
Changes:
- Added early exit paths when no snapshots are expired or no orphaned manifests exist
- Implemented distributed join-based filtering to identify orphaned manifests before reading their content files
- Refactored helper methods in
BaseSparkActionto support the new distributed approach
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| ExpireSnapshotsSparkAction.java | Replaced driver-side collection logic with distributed Spark operations for manifest-level filtering and added contentFilesFromManifestDF() method |
| BaseSparkAction.java | Added emptyFileInfoDS() helper method and changed ReadManifest visibility to protected |
| TestExpireSnapshotsAction.java | Updated expected job count in testUseLocalIterator() test from 4 to 12 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| .as( | ||
| "Expected total number of jobs with stream-results should match the expected number") | ||
| .isEqualTo(4L); | ||
| .isEqualTo(12L); |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expected job count increased from 4 to 12 due to the new distributed operations. Consider adding a comment explaining why this specific count is expected, or add a test case that validates the optimization logic (e.g., verifying early exits when no orphaned manifests exist).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment explaining the job count.
| Dataset<FileInfo> liveStats = statisticsFileDS(updatedTable, null); | ||
| Dataset<FileInfo> orphanedStats = expiredStats.except(liveStats); | ||
|
|
||
| if (orphanedManifestPaths.isEmpty()) { |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using isEmpty() on a Dataset triggers a Spark action that collects data to the driver. Consider using first() wrapped in a try-catch or take(1).length == 0 to avoid potentially expensive operations when checking if a dataset is empty.
| if (orphanedManifestPaths.isEmpty()) { | |
| boolean hasOrphanedManifestPaths = orphanedManifestPaths.limit(1).toLocalIterator().hasNext(); | |
| if (!hasOrphanedManifestPaths) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. Dataset.isEmpty() uses limit(1) and executeTake(1), it only fetches a single row to check emptiness, not the full dataset.
| @@ -1200,10 +1200,12 @@ public void testUseLocalIterator() { | |||
|
|
|||
| checkExpirationResults(1L, 0L, 0L, 1L, 2L, results); | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for you to write another test for this functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your review Alex.
Sorry about that, I have added two tests for the optimization:
testEarlyExitWhenNoOrphanedManifeststestManifestReusedAcrossSnapshots
Let me know if I have misunderstood your comment and if you were looking for something different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! I'll take a look at this first thing tomorrow. The tests help make it easier for myself and others to digest what exactly the code should be doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Alex. I have also added a References section to the PR description linking to the patterns in ReachableFileCleanup that this is based on. Please let me know if it helps with the review.
rambleraptor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not an expert on this area of the codebase, but the rough idea seems reasonable:
- Find list of orphaned manifest lists / stats
- Get list of files from there
| } | ||
|
|
||
| private static class ReadManifest implements FlatMapFunction<ManifestFileBean, FileInfo> { | ||
| protected static class ReadManifest implements FlatMapFunction<ManifestFileBean, FileInfo> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just making this protected seems fine, but I'd love to get another opinion here.
| Dataset<FileInfo> validFileDS = fileDS(updatedMetadata); | ||
|
|
||
| // fetch files referenced by expired snapshots | ||
| // find IDs of expired snapshots |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some comments to break up these code sections? I think it helps to understand the flow of the code
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @joyhaldar it's still a bit unclear to me why the new changes are significantly improving the execution? If we look at how fileDS works, and how spark would execute the antijoin I think we'd be implicitly covered? Do we have any numbers before/after this change or any particular cases which are egregiously inefficient at the moment?
| // fetch files referenced by expired snapshots | ||
| // find IDs of expired snapshots | ||
| Set<Long> deletedSnapshotIds = findExpiredSnapshotIds(originalMetadata, updatedMetadata); | ||
| Dataset<FileInfo> deleteCandidateFileDS = fileDS(originalMetadata, deletedSnapshotIds); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, a lot of the cases called out in the PR description should be implicitly handled when you look at how fileDS works? e.g. we're creating the set of files from the set of manifests, and if there are no manifests it's already an empty set that we're doing the anti-join against.
Do we have any particular cases that we see improve after this change (if there are numbers that would be helpful)?
This PR optimizes
ExpireSnapshotsSparkActionby filtering at the manifest level first, then reading content files only from orphaned manifests. Approach is similar toReachableFileCleanupbut uses distributed Spark operations.Changes:
contentFilesFromManifestDF()to read content files from a filtered manifest DataFrame (existingcontentFileDS()only accepts snapshot IDs, not a filtered DataFrame)emptyFileInfoDS()helperReadManifestto protected inBaseSparkActionBefore
After
Tests:
testEarlyExitWhenNoOrphanedManifeststestManifestReusedAcrossSnapshotsReferences:
ReachableFileCleanup.cleanFiles()iceberg/core/src/main/java/org/apache/iceberg/ReachableFileCleanup.java
Lines 76 to 82 in 83653ba
ReachableFileCleanup.pruneReferencedManifests()iceberg/core/src/main/java/org/apache/iceberg/ReachableFileCleanup.java
Lines 107 to 140 in 83653ba
ReachableFileCleanup.findFilesToDelete()iceberg/core/src/main/java/org/apache/iceberg/ReachableFileCleanup.java
Lines 169 to 188 in 83653ba