Skip to content

Conversation

@NeoLegends
Copy link
Member

@NeoLegends NeoLegends commented Dec 22, 2025

Mohammad did not use the scale in a pure LM setup.

@NeoLegends NeoLegends self-assigned this Dec 22, 2025
@albertz
Copy link
Member

albertz commented Dec 22, 2025

I think in a pure LM setup you don't use the scale.

I don't think this is true in general. I have seen both variants. Also for LMs. For example, Gemma3:

self.embed_tokens = Gemma3TextScaledWordEmbedding(
    config.vocab_size, config.hidden_size, self.padding_idx, embed_scale=self.config.hidden_size**0.5
)

Note, there are some other things to consider:

If you don't apply the scale in forward, what people do instead then is to apply the scale during init, or make the random init very large. E.g. nanochat:

elif isinstance(module, nn.Embedding):
    torch.nn.init.normal_(module.weight, mean=0.0, std=1.0)

I also saw that some people use custom (much larger) LRs for embeddings, which again might compensate the fact of not using a scale. E.g. see nanochat.

If you share the embedding weights with the LM head, this might affect whether you want such a scale or not (I'm not sure in what way, though...). Most LMs do this.

input_dropout: Dropout applied to the input embedding.
input_embedding_scale: Scale applied to the input embedding.
Set to `None` to apply a (tuned) default.
Set to `None` to apply a default that is suitable for ASR AED decoder models.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not mention any specific model at all here. I think this is just confusing. I would instead just say what default you use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants