Skip to content

Conversation

@tomreitz
Copy link
Collaborator

@jayckaiser and I were discussing how to prune unused columns from sources early, to minimize earthmover's memory usage when processing wide data.

This draft PR (not ready for merge, just for discussion!)

  • adds a function util.get_jinja_template_params(template_string, macros) which returns a list of parameters referenced by the Jinja template
  • logs the params used by a destination.template and add_columns/modify_columns transformation operations when config.log_level=DEBUG

Using this concept, I think we could (theoretically)

  • at compile time, walk backwards through the graph (starting at destinations) and, at each node, build a list of upstream columns it references (like [$sources.nodeA:col1, $transformations.nodeB:col1, $transformations.nodeB:col2]) - this would require adding a method like get_upstream_column_refs(downsteam_refs) to each operation, which would be called to backpropagate column references all the way to the sources
  • at run time, immediately keep_columns (or even only read columns, for source files types that support that) only the ref'd columns

One issue to discuss is how to handle "special" references, like __row_data__ (which is used, among other places, by the student_id bundle). Options here include

  • try to refactor student_id (and possibly other bundles) to not use __row_data__ (I'm not actually sure this is possible...)
  • if __row_data__ is referenced, just don't do any column pruning (but this means any project that uses student_id won't benefit from pruning)
  • try to descend into the Jinja AST and unpack which elements of __row_data__ are referenced (this sounds significantly more complicated)

Another thing to discuss is whether this "early pruning" is even the best approach, or should we leave such optimizations to the user and instead focus on making drop_columns and keep_columns more flexible, allowing wildcards or regex, so the user doesn't have to explicitly list all columns to keep/drop. (For example:)

transformations:
  my_wide_data:
    source: $sources.my_wide_source
    operations:
      - operation: keep_columns # or `drop_columns`
        columns:
          - *_values     # prefixed columns, matches `my_values` and `your_values`
          - our_*        # suffixed columns, matches `our_values` and `our_data`
          - our_*_values # wildcard columns, matches `our_awesome_values` and `our_amazing_values`
          - *our_*_values # should we support multiple wildcards? would match `our_awesome_values` and `your_amazing_values`

or, with regex,

transformations:
  my_wide_data:
    source: $sources.my_wide_source
    operations:
      - operation: keep_columns # or `drop_columns`
        regex: True
        columns:
          - "^our.*values$" # regex matches `our_awesome_values` and `our_amazing_values`

Again, I'm opening this PR to show that parsing the params ref'd by a Jinja template is possible, and to open up discussion of whether we actually want earthmover to try to do that.

@tomreitz tomreitz self-assigned this Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants