Transformer Decoder: extend docs on input embedding scale #91

NeoLegends · 2025-12-22T09:55:42Z

Mohammad did not use the scale in a pure LM setup.

albertz · 2025-12-22T11:59:36Z

I think in a pure LM setup you don't use the scale.

I don't think this is true in general. I have seen both variants. Also for LMs. For example, Gemma3:

self.embed_tokens = Gemma3TextScaledWordEmbedding(
    config.vocab_size, config.hidden_size, self.padding_idx, embed_scale=self.config.hidden_size**0.5
)

Note, there are some other things to consider:

If you don't apply the scale in forward, what people do instead then is to apply the scale during init, or make the random init very large. E.g. nanochat:

elif isinstance(module, nn.Embedding):
    torch.nn.init.normal_(module.weight, mean=0.0, std=1.0)

I also saw that some people use custom (much larger) LRs for embeddings, which again might compensate the fact of not using a scale. E.g. see nanochat.

If you share the embedding weights with the LM head, this might affect whether you want such a scale or not (I'm not sure in what way, though...). Most LMs do this.

albertz · 2026-01-20T12:47:57Z

i6_models/assemblies/transformer/transformer_decoder_v1.py

        input_dropout: Dropout applied to the input embedding.
        input_embedding_scale: Scale applied to the input embedding.
-            Set to `None` to apply a (tuned) default.
+            Set to `None` to apply a default that is suitable for ASR AED decoder models.


I would not mention any specific model at all here. I think this is just confusing. I would instead just say what default you use.

Transformer Decoder: extend docs on input embedding scale

c8a0d33

NeoLegends requested review from curufinwe and mmz33 December 22, 2025 09:55

NeoLegends self-assigned this Dec 22, 2025

michelwi approved these changes Dec 22, 2025

View reviewed changes

Remove LM mention

0d38e01

albertz reviewed Jan 20, 2026

View reviewed changes

describe scale

0981626

albertz approved these changes Jan 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transformer Decoder: extend docs on input embedding scale #91

Transformer Decoder: extend docs on input embedding scale #91

Uh oh!

NeoLegends commented Dec 22, 2025 •

edited

Loading

Uh oh!

albertz commented Dec 22, 2025

Uh oh!

albertz Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Transformer Decoder: extend docs on input embedding scale #91

Are you sure you want to change the base?

Transformer Decoder: extend docs on input embedding scale #91

Uh oh!

Conversation

NeoLegends commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

albertz commented Dec 22, 2025

Uh oh!

albertz Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NeoLegends commented Dec 22, 2025 •

edited

Loading