initial implementation of column reference detection in Jinja templates #149

tomreitz · 2025-01-15T17:09:55Z

@jayckaiser and I were discussing how to prune unused columns from sources early, to minimize earthmover's memory usage when processing wide data.

This draft PR (not ready for merge, just for discussion!)

adds a function util.get_jinja_template_params(template_string, macros) which returns a list of parameters referenced by the Jinja template
logs the params used by a destination.template and add_columns/modify_columns transformation operations when config.log_level=DEBUG

Using this concept, I think we could (theoretically)

at compile time, walk backwards through the graph (starting at destinations) and, at each node, build a list of upstream columns it references (like [$sources.nodeA:col1, $transformations.nodeB:col1, $transformations.nodeB:col2]) - this would require adding a method like get_upstream_column_refs(downsteam_refs) to each operation, which would be called to backpropagate column references all the way to the sources
at run time, immediately keep_columns (or even only read columns, for source files types that support that) only the ref'd columns

One issue to discuss is how to handle "special" references, like __row_data__ (which is used, among other places, by the student_id bundle). Options here include

try to refactor student_id (and possibly other bundles) to not use __row_data__ (I'm not actually sure this is possible...)
if __row_data__ is referenced, just don't do any column pruning (but this means any project that uses student_id won't benefit from pruning)
try to descend into the Jinja AST and unpack which elements of __row_data__ are referenced (this sounds significantly more complicated)

Another thing to discuss is whether this "early pruning" is even the best approach, or should we leave such optimizations to the user and instead focus on making drop_columns and keep_columns more flexible, allowing wildcards or regex, so the user doesn't have to explicitly list all columns to keep/drop. (For example:)

transformations:
  my_wide_data:
    source: $sources.my_wide_source
    operations:
      - operation: keep_columns # or `drop_columns`
        columns:
          - *_values     # prefixed columns, matches `my_values` and `your_values`
          - our_*        # suffixed columns, matches `our_values` and `our_data`
          - our_*_values # wildcard columns, matches `our_awesome_values` and `our_amazing_values`
          - *our_*_values # should we support multiple wildcards? would match `our_awesome_values` and `your_amazing_values`

or, with regex,

transformations:
  my_wide_data:
    source: $sources.my_wide_source
    operations:
      - operation: keep_columns # or `drop_columns`
        regex: True
        columns:
          - "^our.*values$" # regex matches `our_awesome_values` and `our_amazing_values`

Again, I'm opening this PR to show that parsing the params ref'd by a Jinja template is possible, and to open up discussion of whether we actually want earthmover to try to do that.

initial implementation of column reference detection in Jinja templates

3f889ab

tomreitz self-assigned this Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

initial implementation of column reference detection in Jinja templates #149

initial implementation of column reference detection in Jinja templates #149

Uh oh!

tomreitz commented Jan 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

initial implementation of column reference detection in Jinja templates #149

Are you sure you want to change the base?

initial implementation of column reference detection in Jinja templates #149

Uh oh!

Conversation

tomreitz commented Jan 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants