initial implementation of column reference detection in Jinja templates #149
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
@jayckaiser and I were discussing how to prune unused columns from
sourcesearly, to minimize earthmover's memory usage when processing wide data.This draft PR (not ready for merge, just for discussion!)
util.get_jinja_template_params(template_string, macros)which returns a list of parameters referenced by the Jinja templatedestination.templateandadd_columns/modify_columnstransformationoperationswhenconfig.log_level=DEBUGUsing this concept, I think we could (theoretically)
[$sources.nodeA:col1, $transformations.nodeB:col1, $transformations.nodeB:col2]) - this would require adding a method likeget_upstream_column_refs(downsteam_refs)to eachoperation, which would be called to backpropagate column references all the way to thesourceskeep_columns(or even only read columns, for source files types that support that) only the ref'd columnsOne issue to discuss is how to handle "special" references, like
__row_data__(which is used, among other places, by thestudent_idbundle). Options here includestudent_id(and possibly other bundles) to not use__row_data__(I'm not actually sure this is possible...)__row_data__is referenced, just don't do any column pruning (but this means any project that usesstudent_idwon't benefit from pruning)__row_data__are referenced (this sounds significantly more complicated)Another thing to discuss is whether this "early pruning" is even the best approach, or should we leave such optimizations to the user and instead focus on making
drop_columnsandkeep_columnsmore flexible, allowing wildcards or regex, so the user doesn't have to explicitly list all columns to keep/drop. (For example:)or, with regex,
Again, I'm opening this PR to show that parsing the params ref'd by a Jinja template is possible, and to open up discussion of whether we actually want earthmover to try to do that.