Skip to content
This repository was archived by the owner on Nov 28, 2025. It is now read-only.
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ request.
* Jupyter images for different versions of TensorFlow
* [TFServing](https://github.com/kubeflow/kubeflow/blob/master/user_guide.md#serve-a-model-using-tensorflow-serving) Docker images and K8s templates
- [kubernetes](kubernetes) - Templates for running distributed TensorFlow on
Kubernetes.
Kubernetes. For the most upto-date examples, please also refer to the [distribution strategy](distribution_strategy) folder.
- [marathon](marathon) - Templates for running distributed TensorFlow using
Marathon, deployed on top of Mesos.
- [hadoop](hadoop) - TFRecord file InputFormat/OutputFormat for Hadoop MapReduce
Expand Down
227 changes: 227 additions & 0 deletions distribution_strategy/multi_worker_mirrored_strategy/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@

# MultiWorkerMirrored Training Strategy with examples

The steps below are meant to train models using [MultiWorkerMirrored Strategy](https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy) using the tensorflow 2.x API on the Kubernetes platform.

Reference programs such as [keras_mnist.py](examples/keras_mnist.py) and
[custom_training_mnist.py](examples/custom_training_mnist.py) and [keras_resnet_cifar.py](examples/keras_resnet_cifar.py) are available in the examples directory.

The Kubernetes manifest templates and other cluster specific configuration is available in the [kubernetes](kubernetes) directory

## Prerequisites

1. (Optional) It is recommended that you have a Google Cloud project. Either create a new project or use an existing one. Install
[gcloud commandline tools](https://cloud.google.com/functions/docs/quickstart)
on your system, login, set project and zone, etc.

2. [Jinja templates](http://jinja.pocoo.org/) must be installed.

3. A Kubernetes cluster running Kubernetes 1.15 or above must be available. To create a test
cluster on the local machine, [follow steps here](https://kubernetes.io/docs/tutorials/kubernetes-basics/create-cluster/). Kubernetes clusters can also be created on all major cloud providers. For instance,
here are instructions to [create GKE clusters](https://cloud.google.com/kubernetes-engine/docs/how-to/creating-a-regional-cluster). Make sure that you have atleast 12 G of RAM between all nodes in the clusters. This should also install the `kubectl` tool on your system

4. Set context for `kubectl` so that `kubectl` knows which cluster to use:

```bash
kubectl config use-context <cluster_name>
```

5. Install [Docker](https://docs.docker.com/get-docker/) for your system, while also creating an account that you can associate with your container images.

6. For the mnist examples, for model storage and checkpointing, a [persistent-volume-claim](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) needs to be available to mount onto the chief worker pod. The steps below include the yaml to create a persistent-volume-claim for GKE backed by GCEPersistentDisk.

### Additional prerequisites for resnet56 example

1. Create a
[service account](https://cloud.google.com/compute/docs/access/service-accounts)
and download its key file in JSON format. Assign Storage Admin role for
[Google Cloud Storage](https://cloud.google.com/storage/) to this service account:

```bash
gcloud iam service-accounts create <service_account_id> --display-name="<display_name>"
```

```bash
gcloud projects add-iam-policy-binding <project-id> \
--member="serviceAccount:<service_account_id>@<project_id>.iam.gserviceaccount.com" \
--role="roles/storage.admin"
```
2. Create a Kubernetes secret from the JSON key file of your service account:

```bash
kubectl create secret generic credential --from-file=key.json=<path_to_json_file>
```

3. For GPU based training, ensure your kubernetes cluster has a node-pool with gpu enabled.
The steps to achieve this on GKE are available [here](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus)

## Steps to train mnist examples

1. Follow the instructions for building and pushing the Docker image to a docker registry
in the [Docker README](examples/README.md).

2. Copy the template file `MultiWorkerMirroredTemplate.yaml.jinja`:

```sh
cp kubernetes/MultiWorkerMirroredTemplate.yaml.jinja myjob.template.jinja
```

3. Edit the `myjob.template.jinja` file to edit job parameters.
1. `script` - which training program needs to be run. This should be either
`keras_mnist.py` or `custom_training_mnist.py` or `your_own_training_example.py`

2. `name` - the prefix attached to all the Kubernetes jobs created

3. `worker_replicas` - number of parallel worker processes that train the example

4. `port` - the port used by tensorflow worker processes to communicate with each other

5. `checkpoint_pvc_name` - name of the persistent-volume-claim that will contain the checkpointed model.

6. `model_checkpoint_dir` - mount location for inspecting the trained model in the volume inspector pod. Meant to be set if Volume inspector pod is mounted.

7. `image` - name of the docker image created in step 2 that needs to be loaded onto the cluster

8. `deploy` - set to True when the manifest is actually expected to be deployed

9. `create_pvc_checkpoint` - Creates a ReadWriteOnce persistent volume claim to checkpoint the model if needed. The name of the claim `checkpoint_pvc_name` should also be specified.

10. `create_volume_inspector` - Create a pod to inspect the contents of the volume after the training job is complete. If this is `True`, `deploy` cannot be `True` since the checkpoint volume can be mounted as read-write by a single node. Inspection cannot happen when training is happenning.

4. Run the job:
1. Create a namespace to run your training jobs

```sh
kubectl create namespace <namespace>
```

2. [Optional: If Persistent volume does not already exist on cluster] First set `deploy` to `False`, `create_pvc_checkpoint` to `True` and set the name of `checkpoint_pvc_name` appropriately in the .jinja file. Then run

```sh
python ../../render_template.py myjob.template.jinja | kubectl apply -n <namespace> -f -
```

This will create a persistent volume claim where you can checkpoint your image. In GKE, this claim will auto-create a GCE persistent disk resource to back up the claim.

3. Set `deploy` to `True`, `create_pvc_checkpoint` to `False`, with all parameters specified in step 4 and then run

```sh
python ../../render_template.py myjob.template.jinja | kubectl apply -n <namespace> -f -
```

This will create the Kubernetes jobs on the clusters. Each Job has a single service-endpoint and a single pod that runs the training image. You can track the running jobs in the cluster by running

```sh
kubectl get jobs -n <namespace>
kubectl describe jobs -n <namespace>
```

In order to inspect the trainining logs that are running in the jobs, run

```sh
# Shows all the running pods
kubectl get pods -n <namespace>
kubectl logs -n <namespace> -p <pod-name>
```

4. Once the jobs are finished (based on the logs/output of kubectl get jobs),
the trained model can be inspected by a volume inspector pod. Set `deploy` to `False`
and `create_volume_inspector` to True. Also set `model_checkpoint_dir` to indicate location where trained model will be mounted. Then run

```sh
python ../../render_template.py myjob.template.jinja | kubectl apply -n <namespace> -f -
```

This will create the volume inspector pod. Then, access the pod through ssh

```sh
kubectl get pods -n <namespace>
kubectl -n <namspace> exec --stdin --tty <volume-inspector-pod> -- /bin/sh
```

The contents of the trained model are available for inspection at `model_checkpoint_dir`.

## Steps to train resnet examples

1. Follow the instructions for building and pushing the Docker image using `Dockerfile.gpu` to a docker registry
in the [Docker README](examples/README.md).

2. Copy the template file `EnhancedMultiWorkerMirroredTemplate.yaml.jinja`

```sh
cp kubernetes/EnhancedMultiWorkerMirroredTemplate.yaml.jinja myjob.template.jinja
```
3. Create three buckets for model data, checkpoints and training logs using either GCP web UI or gsutil tool (included with the gcloud tool you have installed above):

```bash
gsutil mb gs://<bucket_name>
```
You will use these bucket names to modify `data_dir`, `log_dir` and `model_dir` in step #4.


4. Download CIFAR-10 data and place them in your data_dir bucket. Head to the [ResNet in TensorFlow](https://github.com/tensorflow/models/tree/r1.13.0/official/resnet#cifar-10) directory to obtain CIFAR-10 data. Alternatively, you can use this [direct link](https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz) to download and extract the data yourself as well.

```bash
python cifar10_download_and_extract.py
```

Upload the contents of cifar-10-batches-bin directory to your `data_dir` bucket.

```bash
gsutil -m cp cifar-10-batches-bin/* gs://<your_data_dir>/
```

5. Edit the `myjob.template.jinja` file to edit job parameters.
1. `script` - which training program needs to be run. This should be either
`keras_resnet_cifar.py` or `your_own_training_example.py`

2. `name` - the prefix attached to all the Kubernetes jobs created

3. `worker_replicas` - number of parallel worker processes that train the example

4. `port` - the port used by tensorflow worker processes to communicate with each other.

5. `model_dir` - the GCP bucket path that stores the model checkoints `gs://model_dir/`

6. `image` - name of the docker image created in step 2 that needs to be loaded onto the cluster

7. `log_dir` - the GCP bucket path that where the logs are stored `gs://log_dir/`

8. `data_dir` - the GCP bucket path for the Cifar-10 dataset `gs://data_dir/`

9. `gcp_credential_secret` - the name of secret created in the kubernetes cluster that contains the service Account credentials

10. `batch_size` - the global batch size used for training

11. `num_train_epoch` - the number of training epochs

4. Run the job:
1. Create a namespace to run your training jobs

```sh
kubectl create namespace <namespace>
```

2. Deploy the training workloads in the cluster

```sh
python ../../render_template.py myjob.template.jinja | kubectl apply -n <namespace> -f -
```

This will create the Kubernetes jobs on the clusters. Each Job has a single service-endpoint and a single pod that runs the training image. You can track the running jobs in the cluster by running

```sh
kubectl get jobs -n <namespace>
kubectl describe jobs -n <namespace>
```

By default, this also deploys tensorboard on the cluster.

```sh
kubectl get services -n <namespace> | grep tensorboard
```

Note the external-ip corresponding to the service and the previously configured `port` in the yaml
The tensorboard service should be accessible through the web at `http://tensorboard-external-ip:port`

3. The final model should be available in the GCP bucket corresponding to `model_dir` configured in the yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
FROM tensorflow/tensorflow:nightly

# Keeps Python from generating .pyc files in the container
ENV PYTHONDONTWRITEBYTECODE=1

# Turns off buffering for easier container logging
ENV PYTHONUNBUFFERED=1

WORKDIR /app

COPY . /app/

ENTRYPOINT ["python", "keras_mnist.py"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
FROM tensorflow/tensorflow:2.3.1-gpu-jupyter

RUN apt-get install -y python3 && \
apt install python3-pip

RUN pip3 install absl-py && \
pip3 install portpicker

# Install git
RUN apt-get update && \
apt-get install -y git && \
apt-get install -y vim

WORKDIR /app

RUN git clone --single-branch --branch benchmark https://github.com/tensorflow/models.git && \
mv models tensorflow_models && \
git clone https://github.com/tensorflow/model-optimization.git && \
mv model-optimization tensorflow_model_optimization

# Keeps Python from generating .pyc files in the container
ENV PYTHONDONTWRITEBYTECODE=1
# Turns off buffering for easier container logging
ENV PYTHONUNBUFFERED=1

COPY . /app/

ENV PYTHONPATH "${PYTHONPATH}:/:/app/tensorflow_models"

CMD ["python", "resnet_cifar_multiworker_strategy_keras.py"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# TensorFlow Docker Images

This directory contains examples of MultiWorkerMirrored Training along with the docker file to build them

- [Dockerfile](Dockerfile) contains all dependenices required to build a container image using docker with the training examples
- [Dockerfile.gpu](Dockerfile.gpu) contains all dependenices required to build a container image using docker with gpu and the tensorflow model garden
- [keras_mnist.py](mnist.py) demonstrates how to train an MNIST classifier using
[tf.distribute.MultiWorkerMirroredStrategy and Keras Tensorflow 2.0 API](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras).
- [custom_training_mnist.py](mnist.py) demonstrates how to train a fashion MNIST classifier using
[tf.distribute.MultiWorkerMirroredStrategy and Tensorflow 2.0 Custom Training Loop APIs](https://www.tensorflow.org/tutorials/distribute/custom_training).
- [keras_resnet_cifar.py](keras_resnet_cifar.py) demonstrates how to train the resnet56 model on the Cifar-10 dataset using
[tf.distribute.MultiWorkerMirroredStrategy and Keras Tensorflow 2.0 API](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras).
## Best Practices

- Always pin the TensorFlow version with the Docker image tag. This ensures that
TensorFlow updates don't adversely impact your training program for future
runs.
- When creating an image, specify version tags (see below). If you make code
changes, increment the version. Cluster managers will not pull an updated
Docker image if they have them cached. Also, versions ensure that you have
a single copy of the code running for each job.

## Building the Docker Files

Ensure that docker is installed on your system.

First, pick an image name for the job. When running on a cluster manager, you
will want to push your images to a container registry. Note that both the
[Google Container Registry](https://cloud.google.com/container-registry/)
and the [Amazon EC2 Container Registry](https://aws.amazon.com/ecr/) require
special paths. We append `:v1` to version our images. Versioning images is
strongly recommended for reasons described in the best practices section.

```sh
docker build -t <image_name>:v1 -f Dockerfile .
# Use gcloud docker push instead if on Google Container Registry.
docker push <image_name>:v1
```

If you make any updates to the code, increment the version and rerun the above
commands with the new version.

## Running the keras_mnist.py example

The [keras_mnist.py](keras_mnist.py) example demonstrates how to train an MNIST classifier using
[tf.distribute.MultiWorkerMirroredStrategy and Keras Tensorflow 2.0 API](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras).
The final model is saved to disk by the chief worker process. The disk is assumed to be mounted onto the running container by the cluster manager.
It assumes that the cluster configuration is passed in through the `TF_CONFIG` environment variable when deployed in the cluster

## Running the custom_training_mnist.py example

The [custom_training_mnist.py](mnist.py) example demonstrates how to train a fashion MNIST classifier using
[tf.distribute.MultiWorkerMirroredStrategy and Tensorflow 2.0 Custom Training Loop APIs](https://www.tensorflow.org/tutorials/distribute/custom_training).
The final model is saved to disk by the chief worker process. The disk is assumed to be mounted onto the running container by the cluster manager.
It assumes that the cluster configuration is passed in through the `TF_CONFIG` environment variable when deployed in the cluster.

## Running the keras_resnet_cifar.py example

The [keras_resnet_cifar.py](keras_resnet_cifar.py) example demonstrates how to train a Resnet56 model on the cifar-10 dataset using
[tf.distribute.MultiWorkerMirroredStrategy and Keras Tensorflow 2.0 API](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras).
The final model is saved to the GCP storage bucket.
It assumes that the cluster configuration is passed in through the `TF_CONFIG` environment variable when deployed in the cluster.
Loading