-
Notifications
You must be signed in to change notification settings - Fork 443
Feat: Update RL on Multi-Host TPUs tutorial for clarity and structure #2890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
| ``` | ||
| xpk workload create-pathways --workload $WORKLOAD \ | ||
| --docker-image <path/to/gcr.io> --cluster $TPU_CLUSTER \ | ||
| --docker-image $CLOUD_IMAGE_NAME --cluster $TPU_CLUSTER \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be actually gcr.io/$PROJECT_ID/$CLOUD_IMAGE_NAME
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right! Nice catch.
| Alternatively, locally clone the repositories and build with local sources: | ||
|
|
||
| ```bash | ||
| # Clone repositories (if not already done) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be done outside of maxtext.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment.
| run_name=${RUN_NAME} \ | ||
| base_output_directory=${BASE_OUTPUT_DIRECTORY} \ | ||
| hf_access_token=$HF_TOKEN" | ||
| hf_access_token=${HF_TOKEN}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to set hf_access_token?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically no, as the code block currently ignores that flag. However, I suggest keeping it since it's consistent with our docs and other examples. It causes no issues, and the implementation might be updated to use it later anyway.
…RL on Multi-Host TPUs tutorial
| - A Pathways-ready GKE cluster (see [create GKE cluster](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/create-gke-cluster)). | ||
|
|
||
| Setup following environment variables: | ||
| ## Create Virtual Environment and Install MaxText Dependencies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to install MaxText dependencies for multi-host. This section can be removed. Please verify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve verified this and you're right. We don't need to install MaxText dependencies for multi-host, so I have removed that section.
| export MAXTEXT_CKPT_PATH=${BASE_OUTPUT_DIRECTORY}/${RUN_NAME}/0/items | ||
|
|
||
| # -- Workload configuration -- | ||
| export WORKLOAD=${RUN_NAME} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$RUN_NAME and $WORKLOAD are two duplicate variables. Maybe we can just have $WORKLOAD for simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a note to address b/470463466?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I've added a note pointing to the Troubleshooting section to clarify the process for handling failed workloads.
e06d035 to
a6e8759
Compare
Description
This update reorganizes the multi-host TPU reinforcement learning tutorial for MaxText, Tunix, and vLLM, adding a table of contents and revising the sections for environment setup, checkpoint conversion, and Docker image creation. It separates the steps for stable versus local builds, updates the workload submission commands for GRPO and GSPO, and adds a section for troubleshooting.
Tests
Verified the updated documentation by walking through the entire workflow, including environment setup, Docker image builds, and workload submission. The commands executed successfully as described. Attached are two test logs confirming the results.
GRPO log
GSPO log
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.