SNOW-2367850: task integration example update#250
SNOW-2367850: task integration example update#250sfc-gh-ajiang wants to merge 24 commits intomainfrom
Conversation
sfc-gh-dhung
left a comment
There was a problem hiding this comment.
Remember this is a public facing sample, please be sure the code quality is high. It's especially important for the code to be simple and readable, with self documenting variable/function names and sufficient comments for non-experts to understand
| # NOTE: Remove `target_instances=2` to run training on a single node | ||
| # See https://docs.snowflake.com/en/developer-guide/snowflake-ml/ml-jobs/distributed-ml-jobs | ||
| @remote(COMPUTE_POOL, stage_name=JOB_STAGE, target_instances=2) |
There was a problem hiding this comment.
One of the main points of this sample is to demonstrate how easy it is to convert a local pipeline to pushing certain steps down into ML Jobs. Needing to write a separate script file which we submit_file() just for this conversion severely weakens this story. Why can't we just keep using a @remote() decorated function? @remote(...) should convert the function into an MLJobDefinition which we can directly use in pipeline_dag without needing an explicit MLJobDefinition.register() call
There was a problem hiding this comment.
That is currently @remote does not create job definition and it creates a job directly. Currently, we only merged the PR for phase one and phase 2 is in review.
There was a problem hiding this comment.
Let's hold off on merging this until @remote is ready then
There was a problem hiding this comment.
Since the @remote change is now available, can we now call this as an ML Job directly from pipeline_dag?
There was a problem hiding this comment.
I am little confused here. Do you mean we create a job inside the task directly?
| # NOTE: Remove `target_instances=2` to run training on a single node | ||
| # See https://docs.snowflake.com/en/developer-guide/snowflake-ml/ml-jobs/distributed-ml-jobs | ||
| @remote(COMPUTE_POOL, stage_name=JOB_STAGE, target_instances=2) |
There was a problem hiding this comment.
Since the @remote change is now available, can we now call this as an ML Job directly from pipeline_dag?
There was a problem hiding this comment.
Why do we need this as a separate file? Looks like it's only used in pipeline_dag currently
There was a problem hiding this comment.
I was thinking that pipeline_local.py and pipeline_day.py should focus on orchestration logic—like creating jobs or tasks. Since this class is more about handling task configuration, it might make sense to move it into a separate file for better separation of concerns.
For now, I’ve reverted the changes.
| ```python | ||
| @remote(COMPUTE_POOL, stage_name=JOB_STAGE, target_instances=2) | ||
| def train_model(session: Session, input_data: DataSource) -> XGBClassifier: | ||
| def train_model(input_data: DataSource) -> Optional[str]: |
There was a problem hiding this comment.
Can you explain why we return a string in the README?
There was a problem hiding this comment.
Sure, I will update it.
| @remote(COMPUTE_POOL, stage_name=JOB_STAGE, target_instances=2) | ||
| def train_model(input_data: DataSource) -> Optional[str]: | ||
| ... |
There was a problem hiding this comment.
oops, forgot to delete it. will delete it
|
|
||
| ```python | ||
| mv = register_model(session, model, model_name, version, train_ds, metrics) | ||
| # get model version from train model |
There was a problem hiding this comment.
That is because we always register the model. But we do not push it to production.
The reason I do like this is that I got this error #250 (comment) when I save the model to a file.
There was a problem hiding this comment.
The issue is fixed after using the data connector
| from snowflake.snowpark import Session | ||
|
|
||
| import modeling | ||
| import pipeline_dag |
There was a problem hiding this comment.
pipeline_local should not have a dependency on pipeline_dag. Ideally the two don't know about each other, if necessary then dag can depend on local
There was a problem hiding this comment.
Sure — I’ll update it. One thought is to add @remote to modeling.py. However, inside the job payload we rely on the run config, which would cause pipeline_dag.py to import modeling.py and modeling.py to import pipeline_dag.py. That introduces a circular import.
There was a problem hiding this comment.
If I add @remote to pipeline_local.py and pipeline_local.py imports pipeline_dag.py to use the run config, then pipeline_dag.py needs to import pipeline_local.py to use the train_model function. That introduces a circular import.
What do you think about moving the run config out of pipeline_dag.py into a separate file?
There was a problem hiding this comment.
Sounds fine. It's also okay for both pipeline definitions to define their own @remote run_train_model functions, which just handle arguments then pass them to modeling.train_model. In this case, each @remote function should just have a few lines of code (maybe 3-4 at most); e.g. the local pipeline just accepts args and directly passes them to modeling.train_model, while the DAG pipeline reads from RunConfig before calling modeling.train_model
There was a problem hiding this comment.
For pipeline_local.py, several lines are enough. For pipeline_day.py, we also need to have model evaluations in the job that is because the return value for the task could only be string. Original model is not supported to be a return value
| session = session_builder.getOrCreate() | ||
| modeling.ensure_environment(session) | ||
| pipeline_dag._ensure_environment(session) | ||
| cp.register_pickle_by_value(pipeline_dag) |
There was a problem hiding this comment.
because it imports pipeline_dag.py to use train_model
There was a problem hiding this comment.
why does that mean we need to pickle it?
There was a problem hiding this comment.
Currently, the jobs for pipeline_local.py and pipeline_dag.py are in different files. In pipeline_local.py, we only need to pickle the modeling additionally
| config = RunConfig.from_task_context(ctx) | ||
| dataset_info_dicts = json.loads(ctx.get_predecessor_return_value("PREPARE_DATA")) | ||
| except SnowparkSQLException: | ||
| print("there is no predecessor return value, fallback to local mode") |
There was a problem hiding this comment.
make sure errors/warnings are meaningful to users who aren't already familiar with tasks and ml jobs. In this case, predecessor return value and local mode are meaningless/unknown terms
There was a problem hiding this comment.
will update it
There was a problem hiding this comment.
Currently, the jobs for pipeline_local.py and pipeline_dag.py are in different files. No need to add the warnings like this
There was a problem hiding this comment.
the evaluation is inside ML Job. That is because we cannot return the model directly in task
| XGBEstimator, | ||
| XGBScalingConfig, | ||
| ) | ||
| all_cols = input_data_dc.to_pandas(limit=1).columns.tolist() |
There was a problem hiding this comment.
Can you file a JIRA to add a DataConnector.columns API?
There was a problem hiding this comment.
| # inside evaluate_model | ||
| if isinstance(model, Booster): | ||
| dmatrix = DMatrix(X_test) | ||
| actual = (model.predict(dmatrix) > 0.5).astype(int) | ||
| else: | ||
| actual = model.predict(X_test) |
There was a problem hiding this comment.
why do we need this?
There was a problem hiding this comment.
The model type returned is Booster
| session (Session): Snowflake session object | ||
| """ | ||
| modeling.ensure_environment(session) | ||
| cp.register_pickle_by_value(modeling) |
There was a problem hiding this comment.
We still need to model this module. I just want to move all the module registrations into one place.
No description provided.