Core: Move decision about remote scan planning from Spark to Core #15184

nastra · 2026-01-30T08:42:55Z

Previously we introduced the RequiresRemoteScanPlanning marker interface for Spark to properly detect whether a table requires to be remote planned and thus skip all of the distributed planning in SparkDistributedDataScan.
After talking to a few folks, it's probably better to move this decision out of Spark and into Core, hence I've renamed the marker interface to SupportsDistributedScanPlanning. By default, tables support distributed planning, which is then only overridden for RESTTable

nastra · 2026-01-30T13:41:56Z

core/src/main/java/org/apache/iceberg/BaseDistributedDataScan.java

  protected CloseableIterable<ScanTask> doPlanFiles() {
+    if (table() instanceof SupportsDistributedScanPlanning
+        && !((SupportsDistributedScanPlanning) table()).allowDistributedPlanning()) {
+      return table().newBatchScan().planFiles();


actually I don't think this is going to work like this, because the scan object here doesn't carry over any filter/projection/asOfTime/ref settings

I agree, its better to just have a marker, to selectively disable distributed planning

singhpk234 · 2026-01-30T16:49:28Z

core/src/main/java/org/apache/iceberg/SupportsDistributedScanPlanning.java


-/** Marker interface to indicate whether a Table requires remote scan planning */
-public interface RequiresRemoteScanPlanning {}
+/** Marker interface to indicate whether a Table supports distributed scan planning */


minor : logically its no longer a marker interface in a sense if Table implements its not suffiient we need to inspect the API below, how about in the doc we explain what Distributed planning means (i believe its easy to confuse between DistributedPlanning and Remote Planning) and then in the API below describe what does true and false imply, mostly thinking from POV of how

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsPushDownLimit.java is modeled

singhpk234 · 2026-01-30T16:56:03Z

spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

    } else if (table instanceof BaseTable && readConf.distributedPlanningEnabled()) {
      return new SparkDistributedDataScan(spark, table, readConf);


we would need to additionally check this as well

((SupportsDistributedScanPlanning) table).allowDistributedPlanning())

may be we can restructure this whole if else logic ^^

readConf.distributedPlanningEnabled()

we might need to update the docs for this as well, in a sense now this is no longer sufficient condition to enforce distributed planning ? i am mostly thinking from custom BaseTable POV

steveloughran

minor java17 language comments. These would seem a good place to use the guarded switch statements

steveloughran · 2026-01-30T20:15:02Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

  private BatchScan newBatchScan() {
-    if (table instanceof RequiresRemoteScanPlanning) {
+    if (table instanceof SupportsDistributedScanPlanning
+        && !((SupportsDistributedScanPlanning) table).allowDistributedPlanning()) {


you should be able do this in one go with java 14 pattern matching
https://docs.oracle.com/en/java/javase/21/language/pattern-matching-instanceof.html#GUID-E8F57F2F-C14C-4822-9C70-7C76033D4331

if (table instanceof SupportsDistributedScanPlanning distributed && distributed.allowDistributedPlanning()) { ... }

if you are really ambitious you could try guarded switch statements, which would be a lot more elegant

github-actions bot added spark core labels Jan 30, 2026

Core: Move decision about remote scan planning from Spark to Core

a5076de

nastra force-pushed the requires-remote-planning-detection branch from 1bc1dd7 to a5076de Compare January 30, 2026 08:45

nastra commented Jan 30, 2026

View reviewed changes

move decision back into Spark

5cf43ea

singhpk234 reviewed Jan 30, 2026

View reviewed changes

steveloughran reviewed Jan 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Move decision about remote scan planning from Spark to Core #15184

Core: Move decision about remote scan planning from Spark to Core #15184

nastra commented Jan 30, 2026

Uh oh!

nastra Jan 30, 2026

Uh oh!

singhpk234 Jan 30, 2026

Uh oh!

singhpk234 Jan 30, 2026

Uh oh!

singhpk234 Jan 30, 2026

Uh oh!

steveloughran left a comment

Uh oh!

steveloughran Jan 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		} else if (table instanceof BaseTable && readConf.distributedPlanningEnabled()) {
		return new SparkDistributedDataScan(spark, table, readConf);

Core: Move decision about remote scan planning from Spark to Core #15184

Are you sure you want to change the base?

Core: Move decision about remote scan planning from Spark to Core #15184

Conversation

nastra commented Jan 30, 2026

Uh oh!

nastra Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

singhpk234 Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

singhpk234 Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

singhpk234 Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

steveloughran Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

steveloughran Jan 30, 2026 •

edited

Loading