Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
216 changes: 143 additions & 73 deletions _uw-research-computing/machine-learning-htc.md
Original file line number Diff line number Diff line change
@@ -1,120 +1,190 @@
---
highlighter: none
layout: guide
title: Run Machine Learning Jobs
title: Machine learning/AI workflows on the HTC system
guide:
category: Special use cases
tag:
- htc
---

This guide provides some of our recommendations for success
in running machine learning (specifically deep learning) jobs in CHTC.
## Introduction

> This is a new *how-to* guide on the CHTC website. Recommendations and
> feedback are welcome via email (chtc@cs.wisc.edu) or by creating an
> issue on the CHTC website Github repository: [Create an issue](https://github.com/CHTC/chtc-website-source/issues/new)
This guide provides some of our recommendations for success in running machine learning and AI workflows on the HTC system.

Overview
========
<p style="text-align:center"><img src="/images/researcher-to-ai-proficiency.png" alt="Flowchart illustrating three stages to getting AI workflows running on the HTC system: (1) Develop and test, (2) Distribute work, (3) Improve workflow" width=700px></p>

It is important to understand the needs of a machine learning job before submitting
it and have a plan for managing software. This guide covers:
There are **three stages** to getting your ML/AI workflow running smoothly on the HTC system:

1. [Considering job requirements](#1-job-requirements)
2. [Recommendations for managing software](#2-software-options)
1. **Developing and testing** your code
1. **Distributing your work** on the HTC system
1. **Improving** your workflow

# 1. Job Requirements
This guide will provide general recommendations for each stage.

Before digging into the nuts and bolts of software installation in the next section,
it is important to first consider a few other job requirements that might apply to
your machine learning job.
{% capture content %}
- [Introduction](#introduction)
- [Develop and test](#develop-and-test)
* [Create your software environment](#create-your-software-environment)
- [Distribute your work](#distribute-your-work)
* [GPU versus CPU availability](#gpu-versus-cpu-availability)
* [Data size](#data-size)
* [Job length/duration](#job-lengthduration)
* [Test your workflow](#test-your-workflow)
* [Get more throughput by submitting multiple jobs](#get-more-throughput-by-submitting-multiple-jobs)
- [Improve your workflow](#improve-your-workflow)
- [Gotchas and tips](#gotchas-and-tips)
* [CUDA capability](#cuda-capability)
- [Related pages](#related-pages)
{% endcapture %}
{% include /components/directory.html title="Table of Contents" %}

## A. Do you need GPUs?
## Develop and test

CHTC has about 4 publicly available GPUs and thousands of CPUs. When possible, using
CPUs will allow your jobs to start more quickly and to have many running at once. For
certain calculations, GPUs may provide a different advantage as some machine learning
algorithms are optimized to run significantly faster on GPUs. Consider whether you
would benefit from running one or two long-running calculations on a GPU or if your
work is better suited to running many jobs on CHTC's available CPUs.
In this stage, you are developing your code and scripts to be used on the HTC system. You should aim to create a **minimally viable workflow** that can run on a local machine while **laying a foundation for distributed work** that will accomplish your science.

If you need GPUs for your jobs, you can see a summary of available GPUs in CHTC and
how to access them here:
Take hyperparameter tuning, for example. Your desired workflow might look something like this:

* [GPUs in CHTC](gpu-jobs.html)
<p style="text-align:center"><img src="/images/hyperparameter-tuning.png" alt="Flowchart illustrating a hyperparameter tuning workflow." width=500px></p>

Note that you may need to use different versions of your software, depending on whether or
not you are using GPUs, as shown in the [software section of this guide](#2-software-options).

## B. How big is your data?

CHTC's usual data recommendations apply for machine learning jobs. If your job is using
an input data set larger than a few hundred MB or generating output files larger than
a few GB, you will likely need to use our large data
file share. Contact the CHTC Research Computing Facilitators to get access and
read about the large data location here:
Instead of developing your scripts to run entire workflow as a single job, you should develop scripts for sections of your workflow. In our example of hyperparameter tuning, we'll want to focus on developing scripts for the *training* step. Why?

* [Managing Large Data in HTC Jobs](file-avail-largedata.html)
If we develop a general training workflow, we can **reuse the same scripts to submit multiple training tasks**. This is an example of *high throughput computing*!

## C. How long does your job run?
For other steps, like pre-processing and post-processing, we can submit them as separate jobs or scripts managed by [DAGMan](uw-research-computing/htc/dagman-workflows).

CHTC's default job length is 72 hours. If your task is long enough that you will
encounter this limit, contact the CHTC Research Computing Facilitators (chtc@cs.wisc.edu)
for potential work arounds.
### Create your software environment

## D. How many jobs do you want to submit?
Before we start anything, we need to consider our software environment. We recommend running all ML/AI workflows will run inside a [container](software-overview-htc) software environment.

Do you have the ability to break your work into many independent pieces? If so,
you can take advantage of CHTC's capability to run many independent jobs at once,
especially when each job is using a CPU. See our guide for running multiple jobs here:
Each software stack is different, so we encourage users to manage their own software environments and build their own containers.

* [Submitting Multiple Jobs Using HTCondor](multiple-jobs.html)
Be aware that CHTC does not provide consultation services for code development.

# 2. Software Options
> ### Write down your software environment
{:.tip-header}

Many of the tools used for machine learning, specifically deep learning and
convolutional neural networks, have enough dependencies that our usual installation
processes work less reliably. The following options are the best way to handle the complexity
of these software tools.
> Write down what you need to run your software environment, including but not limited to:
> * Software (e.g., Python, R)
> * Packages (e.g., matplotlib, Pytorch, pandas)
> * Dependencies and libraries (e.g., specific versions of GLIBC, libxc)
> * Environment variables
> * CUDA library version (if applicable)
{:.tip}

Please be aware of which CUDA library version you are using to run your code.
Next, you'll need to build your software environment in a new container or use an existing container with your environment.

A. Using Docker Containers
--------------------------
#### Docker containers

CHTC's HTC system has the ability to run jobs using Docker containers, which package
up a whole system (and software) environment in a consistent, reproducible, portable
format. When possible, we recommend using standard, publicly available
Docker containers to run machine learning jobs in CHTC.
Docker is a commonly-used container engine, with many existing containers distributed on Docker Hub.

To see how you can use Docker containers to run jobs in CHTC, see:
* [Docker Jobs in CHTC](docker-jobs.html)
To see how you can use Docker containers to run jobs in CHTC, see our guides:
* [Docker Jobs in CHTC](docker-jobs)
* [GPU/Machine Learning Job Examples on Github](https://github.com/CHTC/templates-GPUs)

You can also test and examine containers on your own computer:
* [Exploring and Testing Docker Containers](docker-test.html)

Some machine learning frameworks publish ready-to-go Docker images:
* [Tensorflow on Docker Hub](https://hub.docker.com/r/tensorflow/tensorflow) - the "Overview" on that page describes how to choose an image.
* [Pytorch on Docker Hub](https://hub.docker.com/r/pytorch/pytorch/tags) - we recommend choosing the most recently published image that ends in `-runtime`.
* [Pytorch on Docker Hub](https://hub.docker.com/r/pytorch/pytorch/tags) - we recommend choosing an image that ends in `-runtime`.
* [NVIDIA CUDA Docker images](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda/tags)

If you can not find a Docker container with exactly the tools you need, you can build your
own, starting with one of the containers above. For instructions on how to build and
test your own Docker container, see this guide:

* [Building Docker Containers](docker-build.html)

B. Using Conda
--------------
#### Apptainer containers

CHTC also supports Apptainer, another container engine. We recommend new users try Apptainer, since Apptainer is already installed on CHTC machines, and users don't need to download it, unlike Docker.

* [Use and build Apptainer containers](apptainer-htc)
* [Additional considerations on building Apptainer containers](apptainer-build)
* [Software recipes](https://github.com/CHTC/recipes/tree/main/software)

#### Conda environments

The Python package manager conda is a popular tool for installing and
managing machine learning tools.
See [this guide](conda-installation.html) for information on how
to use conda to provide dependencies for CHTC jobs.
managing machine learning tools. You can build a conda environment in Docker or Apptainer containers.

[See our conda recipes here.](https://github.com/CHTC/recipes/tree/main/software/Conda)

## Distribute your work

After getting your software stack set up, the next step is to figure out how to effectively utilize the available resources to run your workflow. This section outlines available resources and considerations for your workflow.

### GPU versus CPU availability

CHTC has ~150 shared use GPUs that are in high demand. This means that your jobs may take minutes to hours to start, depending on resource requests. **The more resources (and more specific GPUs) you request, the longer your job may take to start.**

One alternative is to use CPUs. CHTC has over 1000 CPUs! When possible, use CPUs. In general, jobs that use only CPUs (<20 CPUs) start more quickly and allow you to have more of these types of jobs running simultaneously.

For certain calculations, GPUs may provide different advantages, since some machine learning algorithms are optimized to run significantly faster on GPUs. Consider whether you would benefit from running one or two long-running calculations on a GPU or if your work is better suited to running many jobs on CHTC's available CPUs.

You may need to use different versions of your software, depending on whether or
not you are using GPUs, as shown in the [software section of this guide](#2-software-options).

See our [GPU guide](gpu-jobs) for a summary of available GPUs.

### Data size

Most machine learning/AI workflows require large datasets. If your job is using input data greater than 1 GB, you will need to stage the large datasets with our `/staging` file system or through ResearchDrive.

**If you have a large dataset that has lots of small files (<1 GB each)**, you should transfer the dataset as one large file (i.e., `.zip` or `.tar.gz`). Do not transfer a directory from `/staging` recursively. This ensures efficient file transfer and decreases the load on the system.

Read our data transfer guides:

* [Use and transfer data in jobs on the HTC system](htc-job-file-transfer)
* [Managing Large Data in HTC Jobs](file-avail-largedata)
* [Directly transfer files between ResearchDrive and your jobs](htc-uwdf-researchdrive)

### Job length/duration

**For CPU-only jobs**, CHTC's default job length is 72 hours.

**For GPU jobs**, the default job length is `medium`, which runs for a maximum duration of 24 hours. Read our [GPU job guide](gpu-jobs) for more information about GPU job length.

{:.gtable}
| Job type | Maximum runtime | Per-user limitation |
| --- |
| Short | 12 hrs | 2/3 of CHTC GPU Lab GPUs |
| Medium | 24 hrs | 1/3 of CHTC GPU Lab GPUs |
| Long | 7 days | up to 4 GPUs in use |

If you need longer runtimes, consider implementing [checkpointing](checkpointing) into your jobs.

### Test your workflow

Before you submitting your workflow, **you should always test with a single job first**. This will allow you to catch any bugs and errors in a manageable way, as well as reducing the usage of computational resources on failed jobs.

While every workflow looks different, follow our guidelines for development and testing.

1. **Consider runtime, disk space, memory, and GPU needs.** The more (or specific) resources you need, the less of these jobs you will be able to have running at a time. Consider developing your scripts to use fewer or more generalized resources.
1. **Test with a subset of your data.** Instead of using a large dataset (especially those >100 GB), use a smaller dataset to reduce resource usage and time to test.

### Get more throughput by submitting multiple jobs

Do you have the ability to break your work into many independent pieces? (e.g., hyperparameter tuning, inference) If so, you can take advantage of CHTC's capability to run many independent jobs at once, especially when each job is using a CPU. See [our guide for running multiple jobs](multiple-jobs.html).

## Improve your workflow

Once you've gotten a subset of your workflow going, it's time to refine and improve your workflow! This section highlights how to get more out of the available resources ("do more with less"), monitor and organize runs, and/or create ensembles of models.

* Use automated workflows to manage your tasks (e.g. using [DAGMan](htc/dagman-workflows))
* Monitor your training with [Weights and Biases](https://wandb.ai/site/)—however, be aware of API key leakage.
* Use [checkpointing](checkpointing) to shorten your jobs and [increase the number of jobs](#job-lengthduration) you can have running concurrently. This also makes your job resilient against job eviction and machine issues.

## Gotchas and tips

* Check your job logs, especially during testing, to understand your usage (memory, CPU, disk, GPU memory). Use these values to optimize your resource requests.
* If your job is using the conda environment but not utilizing the GPU, check that you are specifying the gpu-specific version of `pytorch/tensorflow`.
* Unsure about something? [Ask for help or give us feedback!](get-help)

### CUDA capability

The CUDA capability is a number that corresponds to the computational capability of NVIDIA devices and loosely correlates with GPU generation. There are considerations between versions of CUDA and capability. See [Wikipedia](https://en.wikipedia.org/wiki/CUDA#GPUs_supported) for a matrix of compatibility.

## Related pages

Note that when installing TensorFlow using `conda`, it is important to install
not the generic `tensorflow` package, but `tensorflow-gpu`. This ensures that
the installation will include the `cudatoolkit` and `cudnn` dependencies
required for TensorFlow 's GPU capability.
* [Use GPUs](gpu-jobs)
* [CHTC Recipes](https://github.com/CHTC/recipes/tree/main/software)
* [Checkpoint jobs](checkpointing)
Binary file added images/hyperparameter-tuning.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/researcher-to-ai-proficiency.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading