Extends penzai to support gemma3 models. #119

guxm2021 · 2025-05-27T14:57:44Z

This PR extends the codebase of penzai to support gemma3 models. The key changes are as follows:

Add parameters use_qk_norm, local_scale_factor, global_scale_factor, local_rope_wavelength, global_rope_wavelength, to llamalike_common.py.
Add function _query_norm and _key_norm in llamalike_common.py
Add extra arguments scale_factor to pz.nn.ApplyRoPE in nn/embeddings.py
Add parameters for the gemma3 models to gemma.py.

…key changes are as follows: - Add parameters `use_qk_norm`, `local_scale_factor`, `global_scale_factor`, `local_rope_wavelength`, `global_rope_wavelength`, to `llamalike_common.py`. - Add function `_query_norm` and `_key_norm` in `llamalike_common.py` - Add extra arguments `scale_factor` to `pz.nn.ApplyRoPE` in `nn/embeddings.py` - Add parameters for the gemma3 models to `gemma.py`. PiperOrigin-RevId: 762356347

danieldjohnson

This looks great, thanks for the PR!

Left a few fairly minor comments about the conversion below.

Also, I'm curious how you tested this. Have you confirmed that the Flax and Penzai implementations produce the same output given the same input? (There are some similar tests in https://github.com/google-deepmind/penzai/blob/main/tests/models/transformer_consistency_test.py for HuggingFace models, ideally there would be similar tests backed by the official gemma PyPI package. These aren't there right now because at the time this was originally written that package didn't exist. I don't think this is required right now if you don't feel like adding it, but it would be good to at least check manually that they give the same numbers in a notebook or something, if you haven't already.)

On the subject of testing, it would also be great if you could add some tests to https://github.com/google-deepmind/penzai/blob/main/tests/models/transformer_llamalike_test.py to make sure the new configurations execute correctly. You can use a smaller model here since it's mostly a test that the components don't raise errors.

It would also be worthwhile to edit the documentation to document how to load gemma 3, specifically here: https://github.com/google-deepmind/penzai/blob/main/docs/guides/howto_reference.md#loading-pretrained-models

danieldjohnson · 2025-06-08T18:42:52Z

penzai/models/transformer/variants/gemma.py

+    preset_name: Literal[
+        "gemma_2b", "gemma_7b", "gemma2_2b", "gemma2_9b", "gemma2_27b",
+        "gemma3_1b", "gemma3_4b", "gemma3_12b", "gemma3_27b",
+    ],
    upcast_activations_to_float32: bool = False,
    use_layer_stack: bool = False,
-    preset_name: Literal[
-        "gemma_2b", "gemma_7b", "gemma2_2b", "gemma2_9b", "gemma2_27b", "auto"
-    ] = "auto",
 ) -> model_parts.TransformerLM:
  """Builds a Gemma model from a pretrained checkpoint.


It is too bad that this is a breaking change in the function signature, since this means existing code will no longer work. Is there some way to do this in a backwards compatible way?

I think it's OK if "auto" does not allow loading gemma 3 models, but it would be nice if it was still possible for us to load gemma 1 and gemma 2 in "auto" mode. Maybe there are differences in the parameter names that we can use, like _query_norm?

Ideal solution would be something like:

keep preset name where it is with "auto" as the default argument

check if this is gemma 3 by looking at something about the params

if it is gemma 3, raise a ValueError and say that you need to specify preset_name

if it is gemma 1 or 2, emit a warning saying you should specify preset name, but then infer it like it is being inferred now

(Probably long term it makes sense to just require the preset to be specified directly, but I'd prefer not to make breaking changes too often if possible.)

Thank you for your suggestion, now I write code to "auto" load gemma 3 models by checking whether the model has qk norm.

danieldjohnson · 2025-06-08T18:45:25Z

penzai/models/transformer/variants/llamalike_common.py

+    global_scale_factor: Scale factor for the gloabl RoPE layers.
+    local_rope_wavelength: Wavelength for the local RoPE layers.
+    global_rope_wavelength: Wavelength for the globalRoPE layers.
  """


Minor, but can we make it so that rope_wavelength can be None, and build_llamalike_attention checks to make sure either rope_wavelength is set OR both local_rope_wavelength and global_rope_wavelength are set, but not both?

Because LlamalikeTransformerConfig is used to transfer the parameters to build_llama_like_attention, we need to first define an object with the dictionary from Gemma 3, at that time, we may need LlamalikeTransformerConfig already set both local_rope_wavelength and global_rope_wavelength. I really appreciate the idea to make it simpler.

Sorry, I don't think I understand what you mean. Are you saying there's some constraint on what works here?

Actually, though, I think the simplest thing to do would be to say that rope_wavelength always means the global RoPE wavelength, and just add local_rope_wavelength: float | None = None. Then, for local RoPE, if config.local_rope_wavelength is not None we use config.local_rope_wavelength and otherwise we use config.rope_wavelength. For global RoPE, we always use config.rope_wavelength.

We could annotate it as

rope_wavelength: Wavelength for global RoPE layers (and for local RoPE layers if local_rope_wavelength is not set).
...
local_rope_wavelength: Wavelength for the local RoPE layers. If None, local RoPE layers will use the same wavelength as global RoPE layers (config.rope_wavelength)

danieldjohnson · 2025-06-08T18:47:08Z

penzai/models/transformer/variants/llamalike_common.py

+  if config.use_qk_norm:
+    input_to_query_sublayers.append(
+        pz.nn.RMSLayerNorm.from_config(
+            name=f"{name}/_query_norm",


Let's remove the leading underscore? I'm not sure why the original parameters have an underscore here, but it seems nicer if the Penzai version doesn't have one. The parameter names are already not exactly the same as the Flax version. (Same comment for _key_norm)

I have fixed it.

danieldjohnson · 2025-06-08T18:52:06Z

penzai/models/transformer/variants/llamalike_common.py

      if config.num_decoder_blocks % len(config.attention_type) != 0:
-        raise ValueError(
-            "Per-layer attention types must have a length that divides the"
-            " number of blocks."
+        logging.warning(
+            "Please ensure that you are using Gemma3 models."
+            "For other models, per-layer attention types must have a length "
+            "that divides the number of blocks."
        )


Hm, this seems less safe and also pretty confusing for users. I don't think we should bypass this check.

Instead, can you do the adjustment in the _GEMMA_PRESETS constant? So, e.g., for "gemma3_1b", the "attention_type" field should be a tuple of length 26. You can do something like ((...,) * 5 + (...,)) to avoid typing it all out.

(Motivation here is that we don't want someone to accidentally mess up their config and end up with a different pattern of attention layers than they expected. It's pretty obvious what should happen when attention types divides number of blocks, but allowing e.g. off-by-one errors seems like it could be a footgun.)

Thank you for your suggestions. I have remained the original check. Instead, I follow gemma package to have a function of make_attention_layers_types in gemma.py, and then simplify the argument for attention_type.

danieldjohnson · 2025-06-08T18:54:49Z

penzai/nn/embeddings.py

      each token in the sequence. This side input should be provided as an
      integer array that is broadcastable with the input, and which does NOT
      include the embedding axis.
+    # NOTE: add extra arguments to support Gemma3 models.


nit: I think it's better for comments to describe the current state of the code rather than the process of when arguments were added. Could you instead make this say something like

scale_factor: The scale factor to use for the positional embeddings (used by Gemma 3 models)

Also please remove the "# NOTE: add extra arguments to support Gemma3 models." comments here and below.

Thank you for your suggestions. I have fixed it.

danieldjohnson · 2025-06-08T18:56:18Z

penzai/nn/embeddings.py

    sinusoid_inp = position / timescale
+    # NOTE: add extra arguments to support Gemma3 models.
+    if self.scale_factor < 1.0:
+      raise ValueError("scale_factor must be >= 1.0, got {scale_factor")


Looks like a typo in format string syntax here?

I have fixed it.

danieldjohnson · 2025-06-08T19:46:48Z

Also, mind using pyink to format your code so that our CI doesn't complain?

resolve the comments from Daniel by removing "# NOTE: add extra arguments to support Gemma3 models." and fixing a typo in format string syntax

resolve the comments from Daniel by enabling "auto" loading gemma 3 models, deleting the leading underscore in qk norm

resolve the comments from Daniel by deleting leading underscore for qk norm, remaining the check for attention types being divided by number of blocks.

add instructions to load gemma3 models

guxm2021 · 2025-06-12T17:57:27Z

Thank Daniel for your detailed comments and suggestions. This week is a quite busy for me, so I failed to respond to your comments earlier. I have followed your comments to revise my code. Please let me know if you have further comments. Regarding the tests, I have run some experiments on colab to compare the model forward of pre-trained Gemma 3 models using penzai and gemma package. The results are the same, which should be correct for my implementations. Currently, it may not be very convenient to upload some testing python files as all my experiments are run on colab internally. But I will share some documents/tutorials about how to use penzai for interpretability research in the future, and I will include some basic tests.

Currently, I have not enabled penzai to load the vision module in Gemma 3 models. But I will do it in the near future.

danieldjohnson

Thanks for the changes! Left a few more small comments.

Also, looks like uv run pyink penzai tests --check is still failing. Can you make sure all of the checks in our CI script pass?

danieldjohnson · 2025-06-15T20:32:50Z

docs/guides/howto_reference.md

 ## Loading Pretrained Models

-### Loading Gemma or Gemma 2
+### Loading Gemma or Gemma 2 or Gemma 3


nit: can you make this "Loading Gemma (1, 2, or 3)"

danieldjohnson · 2025-06-15T20:33:09Z

docs/guides/howto_reference.md

 ckpt_path = os.path.join(weights_dir, 'gemma2_9b_pt')
 ```

+For instance, to load the Gemma 3 4B model, you can use:


nit: can you make this just

To load the Gemma 3 4B model, you can use:

I have fixed it.

danieldjohnson · 2025-06-15T20:33:41Z

penzai/models/transformer/variants/gemma.py

 from penzai.models.transformer.variants import llamalike_common


+def make_attention_layers_types(


nit: can you add an underscore at the beginning to make this private (_make_attention_layers_types )

I have fixed it.

danieldjohnson · 2025-06-15T20:39:04Z

penzai/models/transformer/variants/llamalike_common.py

+    global_scale_factor: Scale factor for the gloabl RoPE layers.
+    local_rope_wavelength: Wavelength for the local RoPE layers.
+    global_rope_wavelength: Wavelength for the globalRoPE layers.
  """


Sorry, I don't think I understand what you mean. Are you saying there's some constraint on what works here?

Actually, though, I think the simplest thing to do would be to say that rope_wavelength always means the global RoPE wavelength, and just add local_rope_wavelength: float | None = None. Then, for local RoPE, if config.local_rope_wavelength is not None we use config.local_rope_wavelength and otherwise we use config.rope_wavelength. For global RoPE, we always use config.rope_wavelength.

We could annotate it as

rope_wavelength: Wavelength for global RoPE layers (and for local RoPE layers if local_rope_wavelength is not set).
...
local_rope_wavelength: Wavelength for the local RoPE layers. If None, local RoPE layers will use the same wavelength as global RoPE layers (config.rope_wavelength)

guxm2021 · 2025-06-15T23:07:48Z

Thank you for your comments. I have fixed them according to your suggestions. About the CI script check, sorry please allow me more time to fix it as I am not familiar with this. But it seems that my working environment already uses pyink as default.

guxm2021 · 2025-06-16T14:46:42Z

@danieldjohnson , Hi, Daniel, now I have fixed all your comments. Sorry to make this PR a little bit messy. I run unittests on my own fork and my recent commit passes all the checks.

danieldjohnson

Looks great! Thanks for doing this.

danieldjohnson requested changes Jun 8, 2025

View reviewed changes

guxm2021 added 4 commits June 12, 2025 10:30

Update embeddings.py

eb8512b

resolve the comments from Daniel by removing "# NOTE: add extra arguments to support Gemma3 models." and fixing a typo in format string syntax

Update gemma.py

f15251e

resolve the comments from Daniel by enabling "auto" loading gemma 3 models, deleting the leading underscore in qk norm

Update llamalike_common.py

cbeb885

resolve the comments from Daniel by deleting leading underscore for qk norm, remaining the check for attention types being divided by number of blocks.

Update howto_reference.md

f2421fe

add instructions to load gemma3 models

Update gemma.py

97c9e95

danieldjohnson reviewed Jun 15, 2025

View reviewed changes

guxm2021 added 3 commits June 15, 2025 23:05

Update howto_reference.md

3dfc906

Update gemma.py

0be6eec

Update llamalike_common.py

0e65e85

guxm2021 added 4 commits June 16, 2025 00:23

Update gemma.py

15b1163

Update llamalike_common.py

d025344

Update llamalike_common.py

4659231

Update gemma.py

6b192f8

guxm2021 force-pushed the enable_gemma3 branch from 656e8c1 to 56c233c Compare June 16, 2025 13:12

guxm2021 added 2 commits June 16, 2025 14:33

pyink fix

41f43df

fix pylint

b3e31ca

guxm2021 force-pushed the enable_gemma3 branch from 56c233c to b3e31ca Compare June 16, 2025 14:36

danieldjohnson approved these changes Jun 20, 2025

View reviewed changes

danieldjohnson merged commit 8aa4aa6 into google-deepmind:main Jun 20, 2025
2 checks passed

		from penzai.models.transformer.variants import llamalike_common


		def make_attention_layers_types(

Extends penzai to support gemma3 models. #119

Extends penzai to support gemma3 models. #119

Uh oh!

Conversation

guxm2021 commented May 27, 2025

Uh oh!

danieldjohnson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danieldjohnson Jun 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danieldjohnson commented Jun 8, 2025

Uh oh!

guxm2021 commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danieldjohnson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guxm2021 commented Jun 15, 2025

Uh oh!

guxm2021 commented Jun 16, 2025

Uh oh!

danieldjohnson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danieldjohnson Jun 8, 2025 •

edited

Loading

guxm2021 commented Jun 12, 2025 •

edited

Loading