Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 17 additions & 4 deletions docs/tutorials/posttraining/multimodal.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,14 @@ Multimodal Large Language Models (LLMs) extend traditional text-only models by i

## Checkpoint Conversion

Recently we have onboarded a new centralized tool for bidirectional checkpoint conversion between MaxText and HuggingFace ([README](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/utils/ckpt_conversion/README.md)). This tool is used for the Gemma3 model family. Use this command to convert an unscanned checkpoint from HuggingFace to MaxText, and save it to `MAXTEXT_CKPT_GCS_PATH`:
Recently we have onboarded a new centralized tool for bidirectional checkpoint conversion between MaxText and HuggingFace ([README](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/utils/ckpt_conversion/README.md)).

Install pytorch:
```
python3 -m pip install torch --index-url https://download.pytorch.org/whl/cpu
```

Then use this command to convert an unscanned checkpoint from HuggingFace to MaxText, and save it to `MAXTEXT_CKPT_GCS_PATH`:

```shell
export HF_ACCESS_TOKEN=hf_...
Expand Down Expand Up @@ -66,7 +73,7 @@ python -m MaxText.decode \
MaxText/configs/base.yml \
model_name=gemma3-4b \
hf_access_token=$HF_ACCESS_TOKEN \
tokenizer_path=assets/tokenizer.gemma3 \
tokenizer_path=src/MaxText/assets/tokenizer.gemma3 \
load_parameters_path=$MAXTEXT_CKPT_GCS_PATH/0/items \
per_device_batch_size=1 \
run_name=ht_test \
Expand All @@ -77,7 +84,7 @@ python -m MaxText.decode \
scan_layers=false \
use_multimodal=true \
prompt='Describe image <start_of_image>' \
image_path='MaxText/test_assets/test_image.jpg' \
image_path='src/MaxText/test_assets/test_image.jpg' \
attention='dot_product'
```

Expand All @@ -94,10 +101,15 @@ Describe image <start_of_image><end_of_turn>
To decode with multiple images at once, you can provide multiple image paths like this:

```
export TARGET_LENGTH=... # Adjust to fit expected output length
export PREDICT_LENGTH=... # Adjust to fit image tokens + text prompt

python -m MaxText.decode \
MaxText/configs/base.yml \
model_name=gemma3-4b \
... \
max_prefill_predict_length=$PREDICT_LENGTH # Adjust to fit image tokens + text prompt \
max_target_length=$TARGET_LENGTH \
image_path=/path/to/image1.jpg,/path/to/image2.jpg \
prompt="Describe each image in a short sentence." # <start_of_image> will be added to prompt if not provided
# or prompt="Describe each image in a short sentence: <start_of_image> and <start_of_image>"
Expand All @@ -113,8 +125,9 @@ Here, we use [ChartQA](https://huggingface.co/datasets/HuggingFaceM4/ChartQA) as


```shell
export UNSCANNED_CKPT_PATH=... # either set to an already available MaxText ckpt or to the one we just converted in the previous step
python -m MaxText.sft_trainer \
$MAXTEXT_REPO_ROOT/configs/sft-vision-chartqa.yml \
src/MaxText/configs/sft-vision-chartqa.yml \
run_name="chartqa-sft" \
model_name=gemma3-4b \
tokenizer_path="google/gemma-3-4b-it" \
Expand Down