Skip to content

Conversation

@ngc92
Copy link
Collaborator

@ngc92 ngc92 commented Jan 19, 2026

due to the boundedness of sin/cos, we don't need bf16 range there and can instead invest in better precision. By keeping it 16 bit, no changes to the kernels are needed.

Copilot AI review requested due to automatic review settings January 19, 2026 10:59
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes the data type used for ROPE (Rotary Position Embedding) frequency storage from BF16 to FP16 in BF16 models. The motivation is that sine/cosine values are bounded in [-1, 1], so the extended exponent range of BF16 is unnecessary, and FP16's better mantissa precision reduces rounding errors.

Changes:

  • Modified frequency tensor allocation to use FP16 instead of BF16 when the model uses BF16
  • Updated ROPE kernel templates to support mixed precision (BF16 I/O with FP16 frequencies)
  • Updated function signatures across the codebase to reflect the new type pairing

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/models/llama_run_state.cpp Allocates frequency tensors as FP16 for BF16 models; precomputes frequencies using FP16
src/kernels/rope.cu Adds FP16 precompute function; templates kernel to support different I/O and frequency types; updates function signatures
src/kernels/kernels.h Updates function declarations to use half* for frequencies with BF16 I/O
src/kernels/kernels.cpp Updates tensor dispatchers to extract frequencies as half when model is BF16

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

src/models/llama_run_state.cpp:84

  • The function should handle unsupported dtypes. If Config.DType is set to an FP8 type or another unsupported dtype, the function will silently return an uninitialized tensor. Consider adding an else clause that throws an error for unsupported dtypes, similar to the pattern used in rope_forward and rope_backward functions.
    if(dtype == ETensorDType::FP16) {
        std::vector<half> freq_cpu(Config.MaxPositionEmbeddings * 2 * Config.head_size());
        precompute_freqs_cis(freq_cpu.data(), Config.head_size(), Config.MaxPositionEmbeddings, Config.RopeTheta);
        CUDA_CHECK(cudaMemcpy(freq.Data, freq_cpu.data(), freq_cpu.size() * sizeof(half), cudaMemcpyHostToDevice));
    } else if (dtype == ETensorDType::FP32) {
        std::vector<float> freq_cpu(Config.MaxPositionEmbeddings * 2 * Config.head_size());
        precompute_freqs_cis(freq_cpu.data(), Config.head_size(), Config.MaxPositionEmbeddings, Config.RopeTheta);
        CUDA_CHECK(cudaMemcpy(freq.Data, freq_cpu.data(), freq_cpu.size() * sizeof(float), cudaMemcpyHostToDevice));
    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ngc92 ngc92 merged commit ee606ff into dev Jan 20, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants