Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 16 additions & 13 deletions docs/guides.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,26 @@
<!--
Copyright 2024 Google LLC
# How-to Guides

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
Practical step-by-step guides for common tasks, optimizations, and workflows in MaxText.

https://www.apache.org/licenses/LICENSE-2.0
## Performance & Optimization
* [**Optimization Factors**](guides/optimization.md)
* Running custom models, configuring sharding strategies, and writing high-performance Pallas kernels.
* [**Monitoring & Debugging**](guides/monitoring_and_debugging.md)
* Tools for diagnosing performance issues, including Goodput monitoring, Cloud Logging, and XProf profiling.

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
## Data & Storage
* [**Data Input Pipelines**](guides/data_input_pipeline.md)
* Configuring data loaders for high performance. Includes Grain (ArrayRecord), Hugging Face, and TFDS pipelines.
* [**Checkpointing**](guides/checkpointing_solutions.md)
* Strategies for saving and restoring model state, including GCS checkpointing, emergency recovery, and multi-tier solutions.

# How-to guides
## Development Workflows
* [**Python Notebooks**](guides/run_python_notebook.md)
* Interactive development using Jupyter/Colab on TPUs. Covers local port-forwarding and Colab setups.

```{toctree}
:maxdepth: 1
:hidden:

guides/optimization.md
guides/data_input_pipeline.md
Expand Down
27 changes: 13 additions & 14 deletions docs/reference.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,22 @@
<!--
Copyright 2024 Google LLC
# Reference Documentation

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
Technical reference material for MaxText architecture, metrics, and configurations.

https://www.apache.org/licenses/LICENSE-2.0
## Core Concepts
* [**Architecture Overview**](reference/architecture.md)
* Deep dive into the design of MaxText and the JAX AI stack choices.
* [**Core Concepts**](reference/core_concepts.md)
* Explanations of fundamental topics like quantization, tiling, MoE configuration, and JAX/XLA/Pallas interactions.

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

# Reference documentation
## Benchmarks & Models
* [**Performance Metrics**](reference/performance_metrics.md)
* Understanding key metrics like Model FLOPs Utilization (MFU), step time, and tokens/second.
* [**Supported Models**](reference/models.md)
* List of supported architectures, model tiering levels, and configuration details.

```{toctree}
:maxdepth: 1
:hidden:

reference/performance_metrics.md
reference/models.md
Expand Down
28 changes: 27 additions & 1 deletion docs/run_maxtext.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,38 @@
# Run MaxText

MaxText provides flexible execution options ranging from local development and single-host experimentation to massively scalable training on thousands of chips. Choose the runbook that matches your infrastructure and goals.

## Local & Single Host
Ideal for development, debugging, and small-scale experimentation.

* [**Localhost / Single VM**](run_maxtext/run_maxtext_localhost.md)
* The best starting point. Run directly on a single TPU VM or GPU machine (e.g., A3/H100).
* Great for learning the basics, testing configurations, and running small models.

* [**Single Host GPU Guide**](run_maxtext/run_maxtext_single_host_gpu.md)
* Specific instructions for setting up and running on NVIDIA GPUs (A3/H100), including CUDA and Docker setup.

* [**Decoupled Mode (No Cloud Dependencies)**](run_maxtext/decoupled_mode.md)
* Run tests and development loops completely offline without Google Cloud dependencies (GCS, JetStream, etc.).

## Multi-Host & Cluster (At Scale)
For large-scale training jobs running on GKE clusters.

* [**Running with XPK (Recommended)**](run_maxtext/run_maxtext_via_xpk.md)
* The standard way to run production workloads on GKE.
* Uses the Accelerated Processing Kit (XPK) to orchestrate Docker containers across TPU/GPU clusters.

* [**Running with Pathways**](run_maxtext/run_maxtext_via_pathways.md)
* Advanced orchestration using the Pathways backend on GKE.
* Supports both batch jobs and interactive "headless" workloads for development.

```{toctree}
:maxdepth: 1
:hidden:

run_maxtext/run_maxtext_localhost.md
run_maxtext/run_maxtext_single_host_gpu.md
run_maxtext/run_maxtext_via_xpk.md
run_maxtext/run_maxtext_via_pathways.md
run_maxtext/decoupled_mode.md

```