Update ML jobs guide #952

New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
xamberl wants to merge 1 commit into master from preview-ml-jobs
_uw-research-computing/machine-learning-htc.md
            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1,120 +1,190 @@
  
    ---

    highlighter: none

    layout: guide

    title: Run Machine Learning Jobs

    title: Machine learning/AI workflows on the HTC system

    guide:

        category: Special use cases

        tag:

            - htc

    ---

    This guide provides some of our recommendations for success 

    in running machine learning (specifically deep learning) jobs in CHTC. 

    ## Introduction

    > This is a new *how-to* guide on the CHTC website. Recommendations and 

    > feedback are welcome via email (chtc@cs.wisc.edu) or by creating an 

    > issue on the CHTC website Github repository: [Create an issue](https://github.com/CHTC/chtc-website-source/issues/new)

    This guide provides some of our recommendations for success in running machine learning and AI workflows on the HTC system.

    Overview

    ========

    <p style="text-align:center"><img src="/images/researcher-to-ai-proficiency.png" alt="Flowchart illustrating three stages to getting AI workflows running on the HTC system: (1) Develop and test, (2) Distribute work, (3) Improve workflow" width=700px></p>

    It is important to understand the needs of a machine learning job before submitting 

    it and have a plan for managing software. This guide covers: 

    There are **three stages** to getting your ML/AI workflow running smoothly on the HTC system:

    1. [Considering job requirements](#1-job-requirements)

    2. [Recommendations for managing software](#2-software-options)

    1. **Developing and testing** your code

    1. **Distributing your work** on the HTC system

    1. **Improving** your workflow

    # 1. Job Requirements

    This guide will provide general recommendations for each stage.

    Before digging into the nuts and bolts of software installation in the next section, 

    it is important to first consider a few other job requirements that might apply to 

    your machine learning job. 

    {% capture content %}

    - [Introduction](#introduction)

    - [Develop and test](#develop-and-test)

       * [Create your software environment](#create-your-software-environment)

    - [Distribute your work](#distribute-your-work)

       * [GPU versus CPU availability](#gpu-versus-cpu-availability)

       * [Data size](#data-size)

       * [Job length/duration](#job-lengthduration)

       * [Test your workflow](#test-your-workflow)

       * [Get more throughput by submitting multiple jobs](#get-more-throughput-by-submitting-multiple-jobs)

    - [Improve your workflow](#improve-your-workflow)

    - [Gotchas and tips](#gotchas-and-tips)

       * [CUDA capability](#cuda-capability)

    - [Related pages](#related-pages)

    {% endcapture %}

    {% include /components/directory.html title="Table of Contents" %}

    ## A. Do you need GPUs?

    ## Develop and test

    CHTC has about 4 publicly available GPUs and thousands of CPUs. When possible, using 

    CPUs will allow your jobs to start more quickly and to have many running at once. For 

    certain calculations, GPUs may provide a different advantage as some machine learning 

    algorithms are optimized to run significantly faster on GPUs. Consider whether you 

    would benefit from running one or two long-running calculations on a GPU or if your 

    work is better suited to running many jobs on CHTC's available CPUs. 

    In this stage, you are developing your code and scripts to be used on the HTC system. You should aim to create a **minimally viable workflow** that can run on a local machine while **laying a foundation for distributed work** that will accomplish your science.

    If you need GPUs for your jobs, you can see a summary of available GPUs in CHTC and 

    how to access them here: 

    Take hyperparameter tuning, for example. Your desired workflow might look something like this:

    * [GPUs in CHTC](gpu-jobs.html)

    <p style="text-align:center"><img src="/images/hyperparameter-tuning.png" alt="Flowchart illustrating a hyperparameter tuning workflow." width=500px></p>

    Note that you may need to use different versions of your software, depending on whether or 

    not you are using GPUs, as shown in the [software section of this guide](#2-software-options). 

    ## B. How big is your data? 

    CHTC's usual data recommendations apply for machine learning jobs. If your job is using 

    an input data set larger than a few hundred MB or generating output files larger than 

    a few GB, you will likely need to use our large data 

    file share. Contact the CHTC Research Computing Facilitators to get access and 

    read about the large data location here: 

    Instead of developing your scripts to run entire workflow as a single job, you should develop scripts for sections of your workflow. In our example of hyperparameter tuning, we'll want to focus on developing scripts for the *training* step. Why?

    * [Managing Large Data in HTC Jobs](file-avail-largedata.html)

    If we develop a general training workflow, we can **reuse the same scripts to submit multiple training tasks**. This is an example of *high throughput computing*!

    ## C. How long does your job run? 

    For other steps, like pre-processing and post-processing, we can submit them as separate jobs or scripts managed by [DAGMan](uw-research-computing/htc/dagman-workflows).

    CHTC's default job length is 72 hours. If your task is long enough that you will 

    encounter this limit, contact the CHTC Research Computing Facilitators (chtc@cs.wisc.edu) 

    for potential work arounds. 

    ### Create your software environment

    ## D. How many jobs do you want to submit? 

    Before we start anything, we need to consider our software environment. We recommend running all ML/AI workflows will run inside a [container](software-overview-htc) software environment.

    Do you have the ability to break your work into many independent pieces? If so, 

    you can take advantage of CHTC's capability to run many independent jobs at once, 

    especially when each job is using a CPU. See our guide for running multiple jobs here: 

    Each software stack is different, so we encourage users to manage their own software environments and build their own containers.

    * [Submitting Multiple Jobs Using HTCondor](multiple-jobs.html)

    Be aware that CHTC does not provide consultation services for code development.

    # 2. Software Options

    > ### Write down your software environment

    {:.tip-header}

    Many of the tools used for machine learning, specifically deep learning and 

    convolutional neural networks, have enough dependencies that our usual installation 

    processes work less reliably. The following options are the best way to handle the complexity 

    of these software tools.  

    > Write down what you need to run your software environment, including but not limited to:

    > * Software (e.g., Python, R)

    > * Packages (e.g., matplotlib, Pytorch, pandas)

    > * Dependencies and libraries (e.g., specific versions of GLIBC, libxc)

    > * Environment variables

    > * CUDA library version (if applicable)

    {:.tip}

    Please be aware of which CUDA library version you are using to run your code. 

    Next, you'll need to build your software environment in a new container or use an existing container with your environment.

    A. Using Docker Containers

    --------------------------

    #### Docker containers

    CHTC's HTC system has the ability to run jobs using Docker containers, which package 

    up a whole system (and software) environment in a consistent, reproducible, portable 

    format. When possible, we recommend using standard, publicly available 

    Docker containers to run machine learning jobs in CHTC. 

    Docker is a commonly-used container engine, with many existing containers distributed on Docker Hub.

    To see how you can use Docker containers to run jobs in CHTC, see: 

    * [Docker Jobs in CHTC](docker-jobs.html)

    To see how you can use Docker containers to run jobs in CHTC, see our guides: 

    * [Docker Jobs in CHTC](docker-jobs)

    * [GPU/Machine Learning Job Examples on Github](https://github.com/CHTC/templates-GPUs)

    You can also test and examine containers on your own computer:

    * [Exploring and Testing Docker Containers](docker-test.html)

    Some machine learning frameworks publish ready-to-go Docker images: 

    * [Tensorflow on Docker Hub](https://hub.docker.com/r/tensorflow/tensorflow) - the "Overview" on that page describes how to choose an image.

    * [Pytorch on Docker Hub](https://hub.docker.com/r/pytorch/pytorch/tags) - we recommend choosing the most recently published image that ends in `-runtime`.

    * [Pytorch on Docker Hub](https://hub.docker.com/r/pytorch/pytorch/tags) - we recommend choosing an image that ends in `-runtime`.

    * [NVIDIA CUDA Docker images](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda/tags)

    If you can not find a Docker container with exactly the tools you need, you can build your 

    own, starting with one of the containers above. For instructions on how to build and 

    test your own Docker container, see this guide: 

    * [Building Docker Containers](docker-build.html)

    B. Using Conda

    --------------

    #### Apptainer containers

    CHTC also supports Apptainer, another container engine. We recommend new users try Apptainer, since Apptainer is already installed on CHTC machines, and users don't need to download it, unlike Docker.

    * [Use and build Apptainer containers](apptainer-htc)

    * [Additional considerations on building Apptainer containers](apptainer-build)

    * [Software recipes](https://github.com/CHTC/recipes/tree/main/software)

    #### Conda environments

    The Python package manager conda is a popular tool for installing and 

    managing machine learning tools.

    See [this guide](conda-installation.html) for information on how 

    to use conda to provide dependencies for CHTC jobs.

    managing machine learning tools. You can build a conda environment in Docker or Apptainer containers.

    [See our conda recipes here.](https://github.com/CHTC/recipes/tree/main/software/Conda)

    ## Distribute your work

    After getting your software stack set up, the next step is to figure out how to effectively utilize the available resources to run your workflow. This section outlines available resources and considerations for your workflow.

    ### GPU versus CPU availability

    CHTC has ~150 shared use GPUs that are in high demand. This means that your jobs may take minutes to hours to start, depending on resource requests. **The more resources (and more specific GPUs) you request, the longer your job may take to start.**

    One alternative is to use CPUs. CHTC has over 1000 CPUs! When possible, use CPUs. In general, jobs that use only CPUs (<20 CPUs) start more quickly and allow you to have more of these types of jobs running simultaneously.

    For certain calculations, GPUs may provide different advantages, since some machine learning algorithms are optimized to run significantly faster on GPUs. Consider whether you would benefit from running one or two long-running calculations on a GPU or if your work is better suited to running many jobs on CHTC's available CPUs. 

    You may need to use different versions of your software, depending on whether or 

    not you are using GPUs, as shown in the [software section of this guide](#2-software-options). 

    See our [GPU guide](gpu-jobs) for a summary of available GPUs.

    ### Data size

    Most machine learning/AI workflows require large datasets. If your job is using input data greater than 1 GB, you will need to stage the large datasets with our `/staging` file system or through ResearchDrive.

    **If you have a large dataset that has lots of small files (<1 GB each)**, you should transfer the dataset as one large file (i.e., `.zip` or `.tar.gz`). Do not transfer a directory from `/staging` recursively. This ensures efficient file transfer and decreases the load on the system.

    Read our data transfer guides:

    * [Use and transfer data in jobs on the HTC system](htc-job-file-transfer)

    * [Managing Large Data in HTC Jobs](file-avail-largedata)

    * [Directly transfer files between ResearchDrive and your jobs](htc-uwdf-researchdrive)

    ### Job length/duration

    **For CPU-only jobs**, CHTC's default job length is 72 hours.

    **For GPU jobs**, the default job length is `medium`, which runs for a maximum duration of 24 hours. Read our [GPU job guide](gpu-jobs) for more information about GPU job length.

    {:.gtable}

      | Job type | Maximum runtime | Per-user limitation | 

      | --- |

      | Short | 12 hrs | 2/3 of CHTC GPU Lab GPUs | 

      | Medium | 24 hrs | 1/3 of CHTC GPU Lab GPUs |  

      | Long  | 7 days | up to 4 GPUs in use | 

    If you need longer runtimes, consider implementing [checkpointing](checkpointing) into your jobs.

    ### Test your workflow

    Before you submitting your workflow, **you should always test with a single job first**. This will allow you to catch any bugs and errors in a manageable way, as well as reducing the usage of computational resources on failed jobs.

    While every workflow looks different, follow our guidelines for development and testing.

    1. **Consider runtime, disk space, memory, and GPU needs.** The more (or specific) resources you need, the less of these jobs you will be able to have running at a time. Consider developing your scripts to use fewer or more generalized resources.

    1. **Test with a subset of your data.** Instead of using a large dataset (especially those >100 GB), use a smaller dataset to reduce resource usage and time to test.

    ### Get more throughput by submitting multiple jobs

    Do you have the ability to break your work into many independent pieces? (e.g., hyperparameter tuning, inference) If so, you can take advantage of CHTC's capability to run many independent jobs at once, especially when each job is using a CPU. See [our guide for running multiple jobs](multiple-jobs.html).

    ## Improve your workflow

    Once you've gotten a subset of your workflow going, it's time to refine and improve your workflow! This section highlights how to get more out of the available resources ("do more with less"), monitor and organize runs, and/or create ensembles of models.

    * Use automated workflows to manage your tasks (e.g. using [DAGMan](htc/dagman-workflows))

    * Monitor your training with [Weights and Biases](https://wandb.ai/site/)—however, be aware of API key leakage.

    * Use [checkpointing](checkpointing) to shorten your jobs and [increase the number of jobs](#job-lengthduration) you can have running concurrently. This also makes your job resilient against job eviction and machine issues.

    ## Gotchas and tips

    * Check your job logs, especially during testing, to understand your usage (memory, CPU, disk, GPU memory). Use these values to optimize your resource requests.

    * If your job is using the conda environment but not utilizing the GPU, check that you are specifying the gpu-specific version of `pytorch/tensorflow`.

    * Unsure about something? [Ask for help or give us feedback!](get-help)

    ### CUDA capability

    The CUDA capability is a number that corresponds to the computational capability of NVIDIA devices and loosely correlates with GPU generation. There are considerations between versions of CUDA and capability. See [Wikipedia](https://en.wikipedia.org/wiki/CUDA#GPUs_supported) for a matrix of compatibility.

    ## Related pages

    Note that when installing TensorFlow using `conda`, it is important to install 

    not the generic `tensorflow` package, but `tensorflow-gpu`. This ensures that 

    the installation will include the `cudatoolkit` and `cudnn` dependencies

    required for TensorFlow 's GPU capability. 

    * [Use GPUs](gpu-jobs)

    * [CHTC Recipes](https://github.com/CHTC/recipes/tree/main/software)

    * [Checkpoint jobs](checkpointing)
images/hyperparameter-tuning.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
images/researcher-to-ai-proficiency.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update ML jobs guide #952

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Update ML jobs guide #952

Are you sure you want to change the base?

Uh oh!

Update ML jobs guide #952

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!