tensorflow · shankgan · Jan 31, 2021 · Jan 31, 2021 · Jan 31, 2021 · Jan 31, 2021
diff --git a/README.md b/README.md
@@ -16,7 +16,7 @@ request.
 	* Jupyter images for different versions of TensorFlow
 	* [TFServing](https://github.com/kubeflow/kubeflow/blob/master/user_guide.md#serve-a-model-using-tensorflow-serving) Docker images and K8s templates
 - [kubernetes](kubernetes) - Templates for running distributed TensorFlow on
-  Kubernetes.
+  Kubernetes. For the most upto-date examples, please also refer to the [distribution strategy](distribution_strategy) folder.
 - [marathon](marathon) - Templates for running distributed TensorFlow using
   Marathon, deployed on top of Mesos.
 - [hadoop](hadoop) - TFRecord file InputFormat/OutputFormat for Hadoop MapReduce

diff --git a/distribution_strategy/multi_worker_mirrored_strategy/README.md b/distribution_strategy/multi_worker_mirrored_strategy/README.md
@@ -0,0 +1,227 @@
+
+# MultiWorkerMirrored Training Strategy with examples
+
+The steps below are meant to train models using [MultiWorkerMirrored Strategy](https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy) using the tensorflow 2.x API on the Kubernetes platform.
+
+Reference programs such as [keras_mnist.py](examples/keras_mnist.py) and
+[custom_training_mnist.py](examples/custom_training_mnist.py) and [keras_resnet_cifar.py](examples/keras_resnet_cifar.py) are available in the examples directory.
+
+The Kubernetes manifest templates and other cluster specific configuration is available in the [kubernetes](kubernetes) directory
+
+## Prerequisites
+
+1. (Optional) It is recommended that you have a Google Cloud project. Either create a new project or use an existing one. Install
+    [gcloud commandline tools](https://cloud.google.com/functions/docs/quickstart)
+    on your system, login, set project and zone, etc.
+
+2. [Jinja templates](http://jinja.pocoo.org/) must be installed.
+
+3. A Kubernetes cluster running Kubernetes 1.15 or above must be available. To create a test
+cluster on the local machine, [follow steps here](https://kubernetes.io/docs/tutorials/kubernetes-basics/create-cluster/). Kubernetes clusters can also be created on all major cloud providers. For instance,
+here are instructions to [create GKE clusters](https://cloud.google.com/kubernetes-engine/docs/how-to/creating-a-regional-cluster). Make sure that you have atleast 12 G of RAM between all nodes in the clusters. This should also install the `kubectl` tool on your system
+
+4. Set context for `kubectl` so that `kubectl` knows which cluster to use:
+
+    ```bash
+    kubectl config use-context <cluster_name>
+    ```
+
+5. Install [Docker](https://docs.docker.com/get-docker/) for your system, while also creating an account that you can associate with your container images.
+
+6. For the mnist examples, for model storage and checkpointing, a [persistent-volume-claim](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) needs to be available to mount onto the chief worker pod. The steps below include the yaml to create a persistent-volume-claim for GKE backed by GCEPersistentDisk.
+
+### Additional prerequisites for resnet56 example
+
+1. Create a
+    [service account](https://cloud.google.com/compute/docs/access/service-accounts) 
+    and download its key file in JSON format. Assign Storage Admin role for 
+    [Google Cloud Storage](https://cloud.google.com/storage/) to this service account:
+
+    ```bash
+    gcloud iam service-accounts create <service_account_id> --display-name="<display_name>"
+    ```
+
+    ```bash
+    gcloud projects add-iam-policy-binding <project-id> \
+    --member="serviceAccount:<service_account_id>@<project_id>.iam.gserviceaccount.com" \
+    --role="roles/storage.admin"
+    ```
+2. Create a Kubernetes secret from the JSON key file of your service account:
+
+    ```bash
+    kubectl create secret generic credential --from-file=key.json=<path_to_json_file>
+    ```
+
+3. For GPU based training, ensure your kubernetes cluster has a node-pool with gpu enabled. 
+   The steps to achieve this on GKE are available [here](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus)
+
+## Steps to train mnist examples
+
+1. Follow the instructions for building and pushing the Docker image to a docker registry
+  in the [Docker README](examples/README.md).
+
+2. Copy the template file `MultiWorkerMirroredTemplate.yaml.jinja`:
+
+  ```sh
+     cp kubernetes/MultiWorkerMirroredTemplate.yaml.jinja myjob.template.jinja
+  ```
+
+3. Edit the `myjob.template.jinja` file to edit job parameters.
+   1. `script` - which training program needs to be run. This should be either
+      `keras_mnist.py` or `custom_training_mnist.py` or `your_own_training_example.py`
+
+   2. `name` - the prefix attached to all the Kubernetes jobs created
+
+   3. `worker_replicas` - number of parallel worker processes that train the example
+
+   4. `port` - the port used by tensorflow worker processes to communicate with each other
+
+   5. `checkpoint_pvc_name` - name of the persistent-volume-claim that will contain the checkpointed model.
+
+   6. `model_checkpoint_dir` - mount location for inspecting the trained model in the volume inspector pod. Meant to be set if Volume inspector pod is mounted.
+
+   7. `image` - name of the docker image created in step 2 that needs to be loaded onto the cluster
+
+   8. `deploy` - set to True when the manifest is actually expected to be deployed
+
+   9. `create_pvc_checkpoint` - Creates a ReadWriteOnce persistent volume claim to checkpoint the model if needed. The name of the claim `checkpoint_pvc_name` should also be specified.
+
+   10. `create_volume_inspector` - Create a pod to inspect the contents of the volume after the training job is complete. If this is `True`, `deploy` cannot be `True` since the checkpoint volume can be mounted as read-write by a single node. Inspection cannot happen when training is happenning.
+
+4. Run the job:
+   1. Create a namespace to run your training jobs
+
+      ```sh
+      kubectl create namespace <namespace>
+      ```
+
+   2. [Optional: If Persistent volume does not already exist on cluster] First set `deploy` to `False`, `create_pvc_checkpoint` to `True` and set the name of `checkpoint_pvc_name` appropriately in the .jinja file. Then run
+
+      ```sh
+      python ../../render_template.py myjob.template.jinja | kubectl apply -n <namespace> -f -
+      ```
+
+      This will create a persistent volume claim where you can checkpoint your image. In GKE, this claim will auto-create a GCE persistent disk resource to back up the claim.
+
+   3. Set `deploy` to `True`, `create_pvc_checkpoint` to `False`, with all parameters specified in step 4 and then run
+
+      ```sh
+      python ../../render_template.py myjob.template.jinja | kubectl apply -n <namespace> -f -
+      ```
+
+      This will create the Kubernetes jobs on the clusters. Each Job has a single service-endpoint and a single pod that runs the training image. You can track the running jobs in the cluster by running
+
+       ```sh
+      kubectl get jobs -n <namespace>
+      kubectl describe jobs -n <namespace>   
+      ```
+
+      In order to inspect the trainining logs that are running in the jobs, run
+
+      ```sh
+      # Shows all the running pods 
+      kubectl get pods -n <namespace>
+      kubectl logs -n <namespace> -p <pod-name>
+      ```
+
+   4. Once the jobs are finished (based on the logs/output of kubectl get jobs),
+      the trained model can be inspected by a volume inspector pod. Set `deploy` to `False`
+      and `create_volume_inspector` to True. Also set `model_checkpoint_dir` to indicate location where trained model will be mounted. Then run
+
+      ```sh
+      python ../../render_template.py myjob.template.jinja | kubectl apply -n <namespace> -f -
+      ```
+
+      This will create the volume inspector pod. Then, access the pod through ssh
+
+      ```sh
+      kubectl get pods -n <namespace>
+      kubectl -n <namspace> exec --stdin --tty <volume-inspector-pod> -- /bin/sh
+      ```
+
+      The contents of the trained model are available for inspection at `model_checkpoint_dir`.
+
+## Steps to train resnet examples
+
+1. Follow the instructions for building and pushing the Docker image using `Dockerfile.gpu`  to a docker registry
+  in the [Docker README](examples/README.md).
+
+2. Copy the template file `EnhancedMultiWorkerMirroredTemplate.yaml.jinja`
+
+  ```sh
+     cp kubernetes/EnhancedMultiWorkerMirroredTemplate.yaml.jinja myjob.template.jinja
+  ```
+3.  Create three buckets for model data, checkpoints and training logs using either GCP web UI or gsutil tool (included with the gcloud tool you have installed above):
+
+    ```bash
+    gsutil mb gs://<bucket_name>
+    ```
+    You will use these bucket names to modify `data_dir`, `log_dir` and `model_dir` in step #4.
+
+
+4. Download CIFAR-10 data and place them in your data_dir bucket. Head to the [ResNet in TensorFlow](https://github.com/tensorflow/models/tree/r1.13.0/official/resnet#cifar-10) directory to obtain CIFAR-10 data. Alternatively, you can use this [direct link](https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz) to download and extract the data yourself as well. 
+
+    ```bash
+    python cifar10_download_and_extract.py
+    ```
+
+    Upload the contents of cifar-10-batches-bin directory to your `data_dir` bucket.
+
+    ```bash
+    gsutil -m cp cifar-10-batches-bin/* gs://<your_data_dir>/
+    ```
+
+5. Edit the `myjob.template.jinja` file to edit job parameters.
+   1. `script` - which training program needs to be run. This should be either
+      `keras_resnet_cifar.py` or `your_own_training_example.py`
+
+   2. `name` - the prefix attached to all the Kubernetes jobs created
+
+   3. `worker_replicas` - number of parallel worker processes that train the example
+
+   4. `port` - the port used by tensorflow worker processes to communicate with each other.
+
+   5. `model_dir` - the GCP bucket path that stores the model checkoints `gs://model_dir/`
+
+   6. `image` - name of the docker image created in step 2 that needs to be loaded onto the cluster
+
+   7. `log_dir` - the GCP bucket path that where the logs are stored `gs://log_dir/`
+
+   8. `data_dir` - the GCP bucket path for the Cifar-10 dataset  `gs://data_dir/`
+
+   9. `gcp_credential_secret` - the name of secret created in the kubernetes cluster that contains the service Account credentials
+
+   10. `batch_size` - the global batch size used for training
+
+   11. `num_train_epoch` - the number of training epochs
+
+4. Run the job:
+   1. Create a namespace to run your training jobs
+
+      ```sh
+      kubectl create namespace <namespace>
+      ```
+
+   2. Deploy the training workloads in the cluster
+
+      ```sh
+      python ../../render_template.py myjob.template.jinja | kubectl apply -n <namespace> -f -
+      ```
+
+      This will create the Kubernetes jobs on the clusters. Each Job has a single service-endpoint and a single pod that runs the training image. You can track the running jobs in the cluster by running
+
+      ```sh
+      kubectl get jobs -n <namespace>
+      kubectl describe jobs -n <namespace>   
+      ```
+
+      By default, this also deploys tensorboard on the cluster. 
+
+      ```sh
+      kubectl get services -n <namespace> | grep tensorboard  
+      ``` 
+
+      Note the external-ip corresponding to the service and the previously configured `port` in the yaml
+      The tensorboard service should be accessible through the web at `http://tensorboard-external-ip:port`
+
+   3. The final model should be available in the GCP bucket corresponding to `model_dir` configured in the yaml
diff --git a/distribution_strategy/multi_worker_mirrored_strategy/examples/Dockerfile b/distribution_strategy/multi_worker_mirrored_strategy/examples/Dockerfile
@@ -0,0 +1,13 @@
+FROM tensorflow/tensorflow:nightly
+
+# Keeps Python from generating .pyc files in the container
+ENV PYTHONDONTWRITEBYTECODE=1
+
+# Turns off buffering for easier container logging
+ENV PYTHONUNBUFFERED=1
+
+WORKDIR /app
+
+COPY . /app/
+
+ENTRYPOINT ["python", "keras_mnist.py"]
diff --git a/distribution_strategy/multi_worker_mirrored_strategy/examples/Dockerfile.gpu b/distribution_strategy/multi_worker_mirrored_strategy/examples/Dockerfile.gpu
@@ -0,0 +1,30 @@
+FROM tensorflow/tensorflow:2.3.1-gpu-jupyter
+
+RUN apt-get install -y python3 && \
+    apt install python3-pip
+
+RUN pip3 install absl-py && \
+    pip3 install portpicker
+
+# Install git
+RUN apt-get update && \
+    apt-get install -y git && \
+    apt-get install -y vim
+
+WORKDIR /app
+
+RUN git clone --single-branch --branch benchmark https://github.com/tensorflow/models.git && \
+    mv models tensorflow_models && \
+    git clone https://github.com/tensorflow/model-optimization.git && \
+    mv model-optimization tensorflow_model_optimization
+
+# Keeps Python from generating .pyc files in the container
+ENV PYTHONDONTWRITEBYTECODE=1
+# Turns off buffering for easier container logging
+ENV PYTHONUNBUFFERED=1
+
+COPY . /app/
+
+ENV PYTHONPATH "${PYTHONPATH}:/:/app/tensorflow_models"
+
+CMD ["python", "resnet_cifar_multiworker_strategy_keras.py"]
diff --git a/distribution_strategy/multi_worker_mirrored_strategy/examples/README.md b/distribution_strategy/multi_worker_mirrored_strategy/examples/README.md
@@ -0,0 +1,62 @@
+# TensorFlow Docker Images
+
+This directory contains examples of MultiWorkerMirrored Training along with the docker file to build them
+
+- [Dockerfile](Dockerfile) contains all dependenices required to build a container image using docker with the training examples
+- [Dockerfile.gpu](Dockerfile.gpu) contains all dependenices required to build a container image using docker with gpu and the tensorflow model garden
+- [keras_mnist.py](mnist.py) demonstrates how to train an MNIST classifier using
+  [tf.distribute.MultiWorkerMirroredStrategy and Keras Tensorflow 2.0 API](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras).
+- [custom_training_mnist.py](mnist.py) demonstrates how to train a fashion MNIST classifier using
+  [tf.distribute.MultiWorkerMirroredStrategy and Tensorflow 2.0 Custom Training Loop APIs](https://www.tensorflow.org/tutorials/distribute/custom_training).
+- [keras_resnet_cifar.py](keras_resnet_cifar.py) demonstrates how to train the resnet56 model on the Cifar-10 dataset using
+  [tf.distribute.MultiWorkerMirroredStrategy and Keras Tensorflow 2.0 API](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras).
+## Best Practices
+
+- Always pin the TensorFlow version with the Docker image tag. This ensures that
+  TensorFlow updates don't adversely impact your training program for future
+  runs.
+- When creating an image, specify version tags (see below). If you make code
+  changes, increment the version. Cluster managers will not pull an updated
+  Docker image if they have them cached. Also, versions ensure that you have
+  a single copy of the code running for each job.
+
+## Building the Docker Files
+
+Ensure that docker is installed on your system.
+
+First, pick an image name for the job. When running on a cluster manager, you
+will want to push your images to a container registry. Note that both the
+[Google Container Registry](https://cloud.google.com/container-registry/)
+and the [Amazon EC2 Container Registry](https://aws.amazon.com/ecr/) require
+special paths. We append `:v1` to version our images. Versioning images is
+strongly recommended for reasons described in the best practices section.
+
+```sh
+docker build -t <image_name>:v1 -f Dockerfile .
+# Use gcloud docker push instead if on Google Container Registry.
+docker push <image_name>:v1
+```
+
+If you make any updates to the code, increment the version and rerun the above
+commands with the new version.
+
+## Running the keras_mnist.py example
+
+The [keras_mnist.py](keras_mnist.py) example demonstrates how to train an MNIST classifier using
+[tf.distribute.MultiWorkerMirroredStrategy and Keras Tensorflow 2.0 API](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras).
+The final model is saved to disk by the chief worker process. The disk is assumed to be mounted onto the running container by the cluster manager.
+It assumes that the cluster configuration is passed in through the `TF_CONFIG` environment variable when deployed in the cluster
+
+## Running the custom_training_mnist.py example
+
+The [custom_training_mnist.py](mnist.py) example demonstrates how to train a fashion MNIST classifier using
+[tf.distribute.MultiWorkerMirroredStrategy and Tensorflow 2.0 Custom Training Loop APIs](https://www.tensorflow.org/tutorials/distribute/custom_training).
+The final model is saved to disk by the chief worker process. The disk is assumed to be mounted onto the running container by the cluster manager.
+It assumes that the cluster configuration is passed in through the `TF_CONFIG` environment variable when deployed in the cluster.
+
+## Running the keras_resnet_cifar.py example
+
+The [keras_resnet_cifar.py](keras_resnet_cifar.py) example demonstrates how to train a Resnet56 model on the cifar-10 dataset using
+[tf.distribute.MultiWorkerMirroredStrategy and Keras Tensorflow 2.0 API](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras).
+The final model is saved to the GCP storage bucket.
+It assumes that the cluster configuration is passed in through the `TF_CONFIG` environment variable when deployed in the cluster.