-
Notifications
You must be signed in to change notification settings - Fork 5
adding a new show command, debug operation, and NoOpDestination #89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| nargs="?", | ||
| type=str, | ||
| help='the command to run: `run`, `compile`, `visualize`' | ||
| help='the command to run: `deps`, `compile`, `run`, `show`, `visualize`' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we actually have a visualize command, maybe that should be removed.
| parser.add_argument("--transpose", | ||
| action='store_true', | ||
| help='transposes the output of `earthmover show`' | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These three new CLI flags are only used by earthmover show.
| "results_file": "", | ||
| "function": "head", | ||
| "rows": 10, | ||
| "transpose": False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default values if the flags are omitted. I think these defaults make sense, but happy to discuss alternatives.
| active_graph = self.filter_graph_on_selector(self.graph, selector=f"{selector}_destination") | ||
| self.execute(active_graph) | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explaining what happens here; suppose you select the transformation node my_node (with earthmover show -s my_node):
- we create a transformation node
my_node_showwithsource=my_nodewhich contains just one debug operation (per the CLI inputs) - we connect the new transformation node
my_node_showto a new NoOpDestination nodemy_node_destinationso earthmover wouldn't prune off our dangling transformation node - we filter down the graph to just the selected
my_node(and upstream and downstream nodes) - we execute this subgraph
The result is that the debug operation will cause information about my_node to be output to the console.
| elif (type(config)==dict or type(config)==YamlMapping) and 'kind' not in config.keys(): | ||
| # default for backward compatibility | ||
| return object.__new__(FileDestination) | ||
| # else: throw an error? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I'm adding a new Destination class, I need a way to instantiate it... since type is already a property of the Node superclass, and class is a reserved Python keyword, I landed on kind as the property one can use (in a destination's config) to pick a specific destination.
In the future, we can extend this to other destination kinds, such as file.jsonl, file.csv, file.tsv, file.parquet, perhaps even database.snowflake or database.postgres (with additional column typing config).
The default destination kind (when unspecified by the user) is the existing FileDestination, for backward compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've started using extension as this keyword. Maybe we can set extension to "debug" in this new model, instead of adding a new keyword field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment here: I think overloading extension is not a good solution. A file's extension is not one-to-one with it's type.
| self.data = ( | ||
| self.upstream_sources[self.source].data | ||
| .map_partitions(lambda x: x.apply(self.render_row, axis=1), meta=pd.Series('str')) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new destination type does nothing, as the name suggests.
|
|
||
| self.error_handler.ctx.update( | ||
| file=self.config.__file__, line=self.config.__line__, node=self, operation=None | ||
| file=self.config.get("__file__",""), line=self.config.get("__line__",0), node=self, operation=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since earthmover show injects nodes into the graph which didn't come from a file, I had to modify the context update in a few places like so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To circumvent needing to change these lines, we should just initialize the No-Op config block as a YamlMapping and set default values for the __file__ and __line__ attributes in the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what would those default values be? (when nodes are added programmatically, not from a YAML configuration file)
Can __file__ and __line__ be None? That's the only default value I think would make sense, but I don't know how it would come through in the logging messages, or if it would cause errors.
| Sort rows by one or more columns. | ||
| ```yaml | ||
| - operation: debug | ||
| function: head | tail | describe | columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the default function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't currently one. What do you think? Does head make sense?
| self.earthmover.logger.info(f"debug ({self.func}{rows_str}{transpose_str}) for {transformation_name}:") | ||
|
|
||
| # call function and display debug info | ||
| if self.func == 'head': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A warning that .head() will raise a warning if the number of rows in the dataframe are less than those specified. We should emulate the head() behavior in the current debug logic above the conditional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this and it doesn't seem to... not sure why, or maybe that's configurable.
|
I'm converting this PR to a draft pending further discussion. The |
This PR adds two new features:
earthmover showcommand, likewise to be used for debuggingas well as a few under-the-hood changes such as a new NoOpDestination (earthmover prunes graph branches with no destination attached, so I had to create this in order to be able to programmatically add a debug operation for
earthmover showwithout actually materializing any file to disk).I will go through and do a self-review, adding some comments about various parts of the changed code.