Skip to content

Conversation

@nastra
Copy link
Contributor

@nastra nastra commented Jan 30, 2026

Previously we introduced the RequiresRemoteScanPlanning marker interface for Spark to properly detect whether a table requires to be remote planned and thus skip all of the distributed planning in SparkDistributedDataScan.
After talking to a few folks, it's probably better to move this decision out of Spark and into Core, hence I've renamed the marker interface to SupportsDistributedScanPlanning. By default, tables support distributed planning, which is then only overridden for RESTTable

@nastra nastra force-pushed the requires-remote-planning-detection branch from 1bc1dd7 to a5076de Compare January 30, 2026 08:45
protected CloseableIterable<ScanTask> doPlanFiles() {
if (table() instanceof SupportsDistributedScanPlanning
&& !((SupportsDistributedScanPlanning) table()).allowDistributedPlanning()) {
return table().newBatchScan().planFiles();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I don't think this is going to work like this, because the scan object here doesn't carry over any filter/projection/asOfTime/ref settings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, its better to just have a marker, to selectively disable distributed planning


/** Marker interface to indicate whether a Table requires remote scan planning */
public interface RequiresRemoteScanPlanning {}
/** Marker interface to indicate whether a Table supports distributed scan planning */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor : logically its no longer a marker interface in a sense if Table implements its not suffiient we need to inspect the API below, how about in the doc we explain what Distributed planning means (i believe its easy to confuse between DistributedPlanning and Remote Planning) and then in the API below describe what does true and false imply, mostly thinking from POV of how

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsPushDownLimit.java is modeled

Comment on lines 767 to 768
} else if (table instanceof BaseTable && readConf.distributedPlanningEnabled()) {
return new SparkDistributedDataScan(spark, table, readConf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would need to additionally check this as well

((SupportsDistributedScanPlanning) table).allowDistributedPlanning())

may be we can restructure this whole if else logic ^^

readConf.distributedPlanningEnabled()

we might need to update the docs for this as well, in a sense now this is no longer sufficient condition to enforce distributed planning ? i am mostly thinking from custom BaseTable POV

Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor java17 language comments. These would seem a good place to use the guarded switch statements

private BatchScan newBatchScan() {
if (table instanceof RequiresRemoteScanPlanning) {
if (table instanceof SupportsDistributedScanPlanning
&& !((SupportsDistributedScanPlanning) table).allowDistributedPlanning()) {
Copy link
Contributor

@steveloughran steveloughran Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should be able do this in one go with java 14 pattern matching
https://docs.oracle.com/en/java/javase/21/language/pattern-matching-instanceof.html#GUID-E8F57F2F-C14C-4822-9C70-7C76033D4331

if (table instanceof SupportsDistributedScanPlanning distributed
        && distributed.allowDistributedPlanning()) {
   ...
 }

if you are really ambitious you could try guarded switch statements, which would be a lot more elegant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants