-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: Move decision about remote scan planning from Spark to Core #15184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
1bc1dd7 to
a5076de
Compare
| protected CloseableIterable<ScanTask> doPlanFiles() { | ||
| if (table() instanceof SupportsDistributedScanPlanning | ||
| && !((SupportsDistributedScanPlanning) table()).allowDistributedPlanning()) { | ||
| return table().newBatchScan().planFiles(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I don't think this is going to work like this, because the scan object here doesn't carry over any filter/projection/asOfTime/ref settings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, its better to just have a marker, to selectively disable distributed planning
|
|
||
| /** Marker interface to indicate whether a Table requires remote scan planning */ | ||
| public interface RequiresRemoteScanPlanning {} | ||
| /** Marker interface to indicate whether a Table supports distributed scan planning */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor : logically its no longer a marker interface in a sense if Table implements its not suffiient we need to inspect the API below, how about in the doc we explain what Distributed planning means (i believe its easy to confuse between DistributedPlanning and Remote Planning) and then in the API below describe what does true and false imply, mostly thinking from POV of how
| } else if (table instanceof BaseTable && readConf.distributedPlanningEnabled()) { | ||
| return new SparkDistributedDataScan(spark, table, readConf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we would need to additionally check this as well
((SupportsDistributedScanPlanning) table).allowDistributedPlanning())
may be we can restructure this whole if else logic ^^
readConf.distributedPlanningEnabled()
we might need to update the docs for this as well, in a sense now this is no longer sufficient condition to enforce distributed planning ? i am mostly thinking from custom BaseTable POV
steveloughran
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor java17 language comments. These would seem a good place to use the guarded switch statements
| private BatchScan newBatchScan() { | ||
| if (table instanceof RequiresRemoteScanPlanning) { | ||
| if (table instanceof SupportsDistributedScanPlanning | ||
| && !((SupportsDistributedScanPlanning) table).allowDistributedPlanning()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should be able do this in one go with java 14 pattern matching
https://docs.oracle.com/en/java/javase/21/language/pattern-matching-instanceof.html#GUID-E8F57F2F-C14C-4822-9C70-7C76033D4331
if (table instanceof SupportsDistributedScanPlanning distributed
&& distributed.allowDistributedPlanning()) {
...
}if you are really ambitious you could try guarded switch statements, which would be a lot more elegant
Previously we introduced the
RequiresRemoteScanPlanningmarker interface for Spark to properly detect whether a table requires to be remote planned and thus skip all of the distributed planning inSparkDistributedDataScan.After talking to a few folks, it's probably better to move this decision out of Spark and into Core, hence I've renamed the marker interface to
SupportsDistributedScanPlanning. By default, tables support distributed planning, which is then only overridden forRESTTable