Skip to content

Comments

SNOW-2367850: task integration example update#250

Open
sfc-gh-ajiang wants to merge 24 commits intomainfrom
ajiang_task_example_update
Open

SNOW-2367850: task integration example update#250
sfc-gh-ajiang wants to merge 24 commits intomainfrom
ajiang_task_example_update

Conversation

@sfc-gh-ajiang
Copy link
Collaborator

No description provided.

Copy link
Collaborator

@sfc-gh-dhung sfc-gh-dhung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember this is a public facing sample, please be sure the code quality is high. It's especially important for the code to be simple and readable, with self documenting variable/function names and sufficient comments for non-experts to understand

Comment on lines -147 to -149
# NOTE: Remove `target_instances=2` to run training on a single node
# See https://docs.snowflake.com/en/developer-guide/snowflake-ml/ml-jobs/distributed-ml-jobs
@remote(COMPUTE_POOL, stage_name=JOB_STAGE, target_instances=2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the main points of this sample is to demonstrate how easy it is to convert a local pipeline to pushing certain steps down into ML Jobs. Needing to write a separate script file which we submit_file() just for this conversion severely weakens this story. Why can't we just keep using a @remote() decorated function? @remote(...) should convert the function into an MLJobDefinition which we can directly use in pipeline_dag without needing an explicit MLJobDefinition.register() call

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is currently @remote does not create job definition and it creates a job directly. Currently, we only merged the PR for phase one and phase 2 is in review.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's hold off on merging this until @remote is ready then

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the @remote change is now available, can we now call this as an ML Job directly from pipeline_dag?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am little confused here. Do you mean we create a job inside the task directly?

Comment on lines -147 to -149
# NOTE: Remove `target_instances=2` to run training on a single node
# See https://docs.snowflake.com/en/developer-guide/snowflake-ml/ml-jobs/distributed-ml-jobs
@remote(COMPUTE_POOL, stage_name=JOB_STAGE, target_instances=2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the @remote change is now available, can we now call this as an ML Job directly from pipeline_dag?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this as a separate file? Looks like it's only used in pipeline_dag currently

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that pipeline_local.py and pipeline_day.py should focus on orchestration logic—like creating jobs or tasks. Since this class is more about handling task configuration, it might make sense to move it into a separate file for better separation of concerns.

For now, I’ve reverted the changes.

```python
@remote(COMPUTE_POOL, stage_name=JOB_STAGE, target_instances=2)
def train_model(session: Session, input_data: DataSource) -> XGBClassifier:
def train_model(input_data: DataSource) -> Optional[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why we return a string in the README?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will update it.

Comment on lines 201 to 203
@remote(COMPUTE_POOL, stage_name=JOB_STAGE, target_instances=2)
def train_model(input_data: DataSource) -> Optional[str]:
...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why repeat this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, forgot to delete it. will delete it


```python
mv = register_model(session, model, model_name, version, train_ds, metrics)
# get model version from train model
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is because we always register the model. But we do not push it to production.
The reason I do like this is that I got this error #250 (comment) when I save the model to a file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is fixed after using the data connector

from snowflake.snowpark import Session

import modeling
import pipeline_dag
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pipeline_local should not have a dependency on pipeline_dag. Ideally the two don't know about each other, if necessary then dag can depend on local

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure — I’ll update it. One thought is to add @remote to modeling.py. However, inside the job payload we rely on the run config, which would cause pipeline_dag.py to import modeling.py and modeling.py to import pipeline_dag.py. That introduces a circular import.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I add @remote to pipeline_local.py and pipeline_local.py imports pipeline_dag.py to use the run config, then pipeline_dag.py needs to import pipeline_local.py to use the train_model function. That introduces a circular import.

What do you think about moving the run config out of pipeline_dag.py into a separate file?

Copy link
Collaborator

@sfc-gh-dhung sfc-gh-dhung Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds fine. It's also okay for both pipeline definitions to define their own @remote run_train_model functions, which just handle arguments then pass them to modeling.train_model. In this case, each @remote function should just have a few lines of code (maybe 3-4 at most); e.g. the local pipeline just accepts args and directly passes them to modeling.train_model, while the DAG pipeline reads from RunConfig before calling modeling.train_model

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For pipeline_local.py, several lines are enough. For pipeline_day.py, we also need to have model evaluations in the job that is because the return value for the task could only be string. Original model is not supported to be a return value

session = session_builder.getOrCreate()
modeling.ensure_environment(session)
pipeline_dag._ensure_environment(session)
cp.register_pickle_by_value(pipeline_dag)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Collaborator Author

@sfc-gh-ajiang sfc-gh-ajiang Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it imports pipeline_dag.py to use train_model

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does that mean we need to pickle it?

Copy link
Collaborator Author

@sfc-gh-ajiang sfc-gh-ajiang Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the jobs for pipeline_local.py and pipeline_dag.py are in different files. In pipeline_local.py, we only need to pickle the modeling additionally

config = RunConfig.from_task_context(ctx)
dataset_info_dicts = json.loads(ctx.get_predecessor_return_value("PREPARE_DATA"))
except SnowparkSQLException:
print("there is no predecessor return value, fallback to local mode")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure errors/warnings are meaningful to users who aren't already familiar with tasks and ml jobs. In this case, predecessor return value and local mode are meaningless/unknown terms

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the jobs for pipeline_local.py and pipeline_dag.py are in different files. No need to add the warnings like this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the evaluation is inside ML Job. That is because we cannot return the model directly in task

XGBEstimator,
XGBScalingConfig,
)
all_cols = input_data_dc.to_pandas(limit=1).columns.tolist()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you file a JIRA to add a DataConnector.columns API?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +233 to +238
# inside evaluate_model
if isinstance(model, Booster):
dmatrix = DMatrix(X_test)
actual = (model.predict(dmatrix) > 0.5).astype(int)
else:
actual = model.predict(X_test)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model type returned is Booster

session (Session): Snowflake session object
"""
modeling.ensure_environment(session)
cp.register_pickle_by_value(modeling)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why move?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to model this module. I just want to move all the module registrations into one place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants