Skip to content

Conversation

@RexBearIU
Copy link
Collaborator

Description

This update reorganizes the multi-host TPU reinforcement learning tutorial for MaxText, Tunix, and vLLM, adding a table of contents and revising the sections for environment setup, checkpoint conversion, and Docker image creation. It separates the steps for stable versus local builds, updates the workload submission commands for GRPO and GSPO, and adds a section for troubleshooting.

Tests

Verified the updated documentation by walking through the entire workflow, including environment setup, Docker image builds, and workload submission. The commands executed successfully as described. Attached are two test logs confirming the results.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov
Copy link

codecov bot commented Dec 24, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

```
xpk workload create-pathways --workload $WORKLOAD \
--docker-image <path/to/gcr.io> --cluster $TPU_CLUSTER \
--docker-image $CLOUD_IMAGE_NAME --cluster $TPU_CLUSTER \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be actually gcr.io/$PROJECT_ID/$CLOUD_IMAGE_NAME

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right! Nice catch.

Alternatively, locally clone the repositories and build with local sources:

```bash
# Clone repositories (if not already done)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be done outside of maxtext.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment.

run_name=${RUN_NAME} \
base_output_directory=${BASE_OUTPUT_DIRECTORY} \
hf_access_token=$HF_TOKEN"
hf_access_token=${HF_TOKEN}"
Copy link
Collaborator

@SurbhiJainUSC SurbhiJainUSC Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to set hf_access_token?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically no, as the code block currently ignores that flag. However, I suggest keeping it since it's consistent with our docs and other examples. It causes no issues, and the implementation might be updated to use it later anyway.

- A Pathways-ready GKE cluster (see [create GKE cluster](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/create-gke-cluster)).

Setup following environment variables:
## Create Virtual Environment and Install MaxText Dependencies
Copy link
Collaborator

@SurbhiJainUSC SurbhiJainUSC Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to install MaxText dependencies for multi-host. This section can be removed. Please verify.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve verified this and you're right. We don't need to install MaxText dependencies for multi-host, so I have removed that section.

export MAXTEXT_CKPT_PATH=${BASE_OUTPUT_DIRECTORY}/${RUN_NAME}/0/items

# -- Workload configuration --
export WORKLOAD=${RUN_NAME}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$RUN_NAME and $WORKLOAD are two duplicate variables. Maybe we can just have $WORKLOAD for simplicity.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a note to address b/470463466?

Copy link
Collaborator Author

@RexBearIU RexBearIU Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I've added a note pointing to the Troubleshooting section to clarify the process for handling failed workloads.

@RexBearIU RexBearIU force-pushed the jackyf/docs/rl_multi branch from e06d035 to a6e8759 Compare December 30, 2025 04:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants