diff --git a/docs/guides.md b/docs/guides.md index ade48b4a8..5fd6d644d 100644 --- a/docs/guides.md +++ b/docs/guides.md @@ -1,23 +1,26 @@ - +## Data & Storage +* [**Data Input Pipelines**](guides/data_input_pipeline.md) + * Configuring data loaders for high performance. Includes Grain (ArrayRecord), Hugging Face, and TFDS pipelines. +* [**Checkpointing**](guides/checkpointing_solutions.md) + * Strategies for saving and restoring model state, including GCS checkpointing, emergency recovery, and multi-tier solutions. -# How-to guides +## Development Workflows +* [**Python Notebooks**](guides/run_python_notebook.md) + * Interactive development using Jupyter/Colab on TPUs. Covers local port-forwarding and Colab setups. ```{toctree} :maxdepth: 1 +:hidden: guides/optimization.md guides/data_input_pipeline.md diff --git a/docs/reference.md b/docs/reference.md index 904b95496..f826809e0 100644 --- a/docs/reference.md +++ b/docs/reference.md @@ -1,23 +1,22 @@ - - -# Reference documentation +## Benchmarks & Models +* [**Performance Metrics**](reference/performance_metrics.md) + * Understanding key metrics like Model FLOPs Utilization (MFU), step time, and tokens/second. +* [**Supported Models**](reference/models.md) + * List of supported architectures, model tiering levels, and configuration details. ```{toctree} :maxdepth: 1 +:hidden: reference/performance_metrics.md reference/models.md diff --git a/docs/run_maxtext.md b/docs/run_maxtext.md index 7000face2..c3ad73018 100644 --- a/docs/run_maxtext.md +++ b/docs/run_maxtext.md @@ -1,12 +1,38 @@ # Run MaxText +MaxText provides flexible execution options ranging from local development and single-host experimentation to massively scalable training on thousands of chips. Choose the runbook that matches your infrastructure and goals. + +## Local & Single Host +Ideal for development, debugging, and small-scale experimentation. + +* [**Localhost / Single VM**](run_maxtext/run_maxtext_localhost.md) + * The best starting point. Run directly on a single TPU VM or GPU machine (e.g., A3/H100). + * Great for learning the basics, testing configurations, and running small models. + +* [**Single Host GPU Guide**](run_maxtext/run_maxtext_single_host_gpu.md) + * Specific instructions for setting up and running on NVIDIA GPUs (A3/H100), including CUDA and Docker setup. + +* [**Decoupled Mode (No Cloud Dependencies)**](run_maxtext/decoupled_mode.md) + * Run tests and development loops completely offline without Google Cloud dependencies (GCS, JetStream, etc.). + +## Multi-Host & Cluster (At Scale) +For large-scale training jobs running on GKE clusters. + +* [**Running with XPK (Recommended)**](run_maxtext/run_maxtext_via_xpk.md) + * The standard way to run production workloads on GKE. + * Uses the Accelerated Processing Kit (XPK) to orchestrate Docker containers across TPU/GPU clusters. + +* [**Running with Pathways**](run_maxtext/run_maxtext_via_pathways.md) + * Advanced orchestration using the Pathways backend on GKE. + * Supports both batch jobs and interactive "headless" workloads for development. + ```{toctree} :maxdepth: 1 +:hidden: run_maxtext/run_maxtext_localhost.md run_maxtext/run_maxtext_single_host_gpu.md run_maxtext/run_maxtext_via_xpk.md run_maxtext/run_maxtext_via_pathways.md run_maxtext/decoupled_mode.md - ```