Skip to content

Scheduling Notebooks #46

@amit1rrr

Description

@amit1rrr

Problem

  • How can I quickly go from experimentation (.ipynb) to production (typically .py)?

Current Solution

The prevailing method of "productioniz'ing" notebooks is,

  1. Convert notebooks to python scripts
  2. Clean up the script, write some tests, get a code review done
  3. Setup cloud machine, install all the libraries (or dockerize the script)
  4. Run the script manually or setup a crontab

Is this really efficient?

Challenging the Status Quo

What if we run notebooks directly for our production workflows? Here are some benefits,

  • Rich output for each execution (notebook itself!)
  • Quickly go from experimentation to production. No time spent in extracting code from .ipynb
  • Failed workflows are easy to debug (thanks to the rich notebook output)

Why do we really need to convert notebooks to python scripts? Here are a few common objections (I'd love to learn more in comments),

  • Code Review - We can review notebooks directly with ReviewNB & nbdime (.py is not necessary).
  • Testing - We can directly write tests for notebook code with Treon and a few other tools (.py is not necessary here either).
  • Code reuse - This is a legit reason. You should definitely convert notebook code into libraries whenever possible. It makes reuse super easy and keeps the notebook readable. But we don't need to convert entire notebook into a script, do we? The final execution can easily be running a notebook that imports the libraries we created.

Proposed Solution

  • You select a notebook from GitHub repo and set a schedule for it to run (once/daily/weekly etc.).
  • You select the instance type (memory, vCPU) for execution.
  • You can specify different parameters for each run via Papermill
  • ReviewNB executes this notebook on your specified schedule & preserves the result of each run (as an executed notebook)
  • ReviewNB supports notebook workflows (parallel executions for different parameters, result of one notebook feeds into the next etc.)
  • For environment, we use stable versions of commonly used DS libraries. User can specify their own environment as well (via dockerfile)

Motivation

FAQ

  • Can we run notebooks on our own hardware?
    Absolutely. You can self host ReviewNB & hook it up to your own AWS/GCP account to execute notebooks on your own machines.

  • How will I specify sensitive data (e.g. DB credentials) required for execution?
    ReviewNB provides a prompt to set any sensitive data as environment variables that are available to notebook at runtime.


Feel free to upvote/downvote the issue indicating whether you think this is useful feature or not. I also welcome additional questions/comments/discussion on the issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Feature RequestA new feature that's under consideration.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions