[large tensor] fix CUDA extensions int64 overflow for large tensor dimensions by zrr1999 · Pull Request #561 · PaddlePaddle/PaddleFleet

zrr1999 · 2026-02-12T09:30:56Z

主要修复 CUDA 扩展在支持大张量时的 int32 溢出问题，修改涉及多个 .cu 和 utils.h 文件：

count_cumsum.cu
- load_128_bits / store_128_bits 增加 IdxT 模板参数，支持 int64_t 索引
- 循环变量 N_vec、i 改为 int64_t，避免大 N 溢出
- 增加 N == 0 的提前返回
filter_scores.cu
- 循环步长 gridDim.x * blockDim.x 使用 static_cast<int64_t> 避免溢出
- grid_size 先用 int64_t 计算再转成 int
- 增加 PD_CHECK，检查 total_elements / total_valid 不超过 INT_MAX
fuse_stack_transpose_fp8_quant.cu
- grid_x 使用 int64_t 计算
- 增加 PADDLE_ENFORCE_LE，保证 grid.x <= INT_MAX
fuse_swiglu_scale.cu / swiglu_kernel.cu
- 对 rows == 0 或 hidden_size == 0 做提前返回
- 增加 rows * hidden2 / rows * input_dim <= INT_MAX 的检查
router_metadata.cu
- 用 PADDLE_ENFORCE_LE 替代 TODO，检查 num_tokens * K <= INT_MAX
tokens_unzip_gather.cu
- 调整 quanted_hidden_size 逻辑
- 当 expert 无 token 时跳过 kernel 启动
tokens_unzip_slice.cu
- 循环步长使用 static_cast<int64_t>(blockDim.x) * gridDim.x
- 对 total_zipped_rows == 0 提前返回
tokens_zip_prob.cu
- num_expert、topk 改为 int64_t，并增加 INT_MAX 检查
- total_items 与 grid 计算改为使用 int64_t 再转 int
tokens_zip_unique_add.cu
- 循环索引与步长改为 int64_t，避免 hidden_size 溢出
utils.h
- unrolled_memcpy、vectorized_memcpy、try_vectorized_memcpy 的 num_elements 改为 int64_t

Copilot

Pull request overview

该 PR 主要围绕 CUDA 自定义算子在“大 tensor / 大维度”场景下的健壮性改造：将部分索引/计数从 int 升级到 int64_t，并在若干 kernel launcher 前增加 INT_MAX 边界保护与 0-size 快速返回，以避免整数溢出与无效 kernel launch。

Changes:

将多处 kernel/辅助函数的元素计数、循环索引改为 int64_t，减少大尺寸下的溢出风险
为若干算子新增 INT_MAX 上界检查、0-size 提前返回，避免非法配置/无效 launch
小幅调整部分逻辑分支以减少不必要的 kernel 启动

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`src/paddlefleet/_extensions/utils.h`	memcpy 辅助函数参数/索引改为 `int64_t`，并调整函数声明格式
`src/paddlefleet/_extensions/tokens_zip_unique_add.cu`	kernel 内循环索引改为 `int64_t`，避免 hidden_size 大时溢出
`src/paddlefleet/_extensions/tokens_zip_prob.cu`	增加 `num_expert/topk` 的 `INT_MAX` 检查并改用 `int64_t` 中间量计算 grid 等
`src/paddlefleet/_extensions/tokens_unzip_slice.cu`	循环步进改为 `int64_t` 计算，并在 0 行时提前返回
`src/paddlefleet/_extensions/tokens_unzip_gather.cu`	整理 scale 形状读取逻辑，补齐无 scale 时的 `quanted_hidden_size` 计算，并跳过无 token 的 kernel launch
`src/paddlefleet/_extensions/swiglu_kernel.cu`	增加 0-size 早返回与 `rows*input_dim <= INT_MAX` 检查
`src/paddlefleet/_extensions/router_metadata.cu`	增加 `num_tokens*K <= INT_MAX` 检查并在部分位置显式 cast
`src/paddlefleet/_extensions/fuse_swiglu_scale.cu`	forward/backward 增加 0-size 早返回与 `rows*hidden2 <= INT_MAX` 检查
`src/paddlefleet/_extensions/fuse_stack_transpose_fp8_quant.cu`	对 grid.x 做 `INT_MAX` 上界保护并引入 `int64_t` 中间变量
`src/paddlefleet/_extensions/filter_scores.cu`	增加 `total_elements/total_valid` 的 `INT_MAX` 检查，grid_size 用 int64 计算后再安全 cast
`src/paddlefleet/_extensions/count_cumsum.cu`	128-bit load/store 支持更大索引类型，局部循环索引改为 `int64_t`，并新增 N==0 提前返回

Copilot · 2026-02-12T09:42:21Z

src/paddlefleet/_extensions/fuse_stack_transpose_fp8_quant.cu

+  int64_t grid_x = (M / 128) * (K / 128);
+  PADDLE_ENFORCE_LE(
+      grid_x,
+      static_cast<int64_t>(std::numeric_limits<int>::max()),
+      common::errors::InvalidArgument(
+          "grid.x exceeds INT_MAX in fuse_stack_transpose_fp8_quant."));


这里计算 grid_x = (M/128)*(K/128) 使用 int64_t 乘法，若 M/K 极大可能在比较前发生有符号溢出（UB），导致后续范围检查失效。建议使用更安全的乘法检查（例如用 __int128 计算并比较，或先分别校验 M/128 与 K/128 的上界再相乘）。

Suggested change

int64_t grid_x = (M / 128) * (K / 128);

PADDLE_ENFORCE_LE(

grid_x,

static_cast<int64_t>(std::numeric_limits<int>::max()),

common::errors::InvalidArgument(

"grid.x exceeds INT_MAX in fuse_stack_transpose_fp8_quant."));

int64_t m_blocks = M / 128;

int64_t k_blocks = K / 128;

__int128 grid_x_wide =

static_cast<__int128>(m_blocks) * static_cast<__int128>(k_blocks);

PADDLE_ENFORCE_LE(

grid_x_wide,

static_cast<__int128>(std::numeric_limits<int>::max()),

common::errors::InvalidArgument(

"grid.x exceeds INT_MAX in fuse_stack_transpose_fp8_quant."));

int64_t grid_x = static_cast<int64_t>(grid_x_wide);

Copilot · 2026-02-12T09:42:22Z

src/paddlefleet/_extensions/utils.h

 template <typename T>
 __device__ __forceinline__ void unrolled_memcpy(const T* src,
                                                T* dst,
-                                                const int num_elements) {
+                                                const int64_t num_elements) {
 #pragma unroll
-  for (int idx = threadIdx.x; idx < num_elements; idx += blockDim.x) {
+  for (int64_t idx = threadIdx.x; idx < num_elements; idx += blockDim.x) {


PR 标题当前为 "update"，不符合仓库要求的 [CLASS]Title 格式（例如 [BugFix] ...）。另外本次 PR 未提供清晰的描述（至少需要说明为什么改、解决什么问题），建议在 PR 元信息中补全后再合入，方便后续追溯与发布说明。

Copilot · 2026-02-12T09:42:22Z

src/paddlefleet/_extensions/count_cumsum.cu

    cumsum_output = paddle::empty({0}, paddle::DataType::INT32, place);
  }

+  if (N == 0) {


count_cumsum 在 N==0 时直接 return，但此时 count_output / cumsum_output 是用 paddle::empty 创建的，内容未初始化；语义上应返回全 0（原先 kernel 会把输出清零）。建议在 N==0 分支改为创建 zeros/full(0) 张量，或在 return 前对输出做显式清零。

Suggested change

if (N == 0) {

if (N == 0) {

// For N == 0, semantics require all-zero outputs instead of uninitialized memory.

count_output.zero_();

if (do_cumsum) {

cumsum_output.zero_();

}

Copilot · 2026-02-12T09:42:22Z

src/paddlefleet/_extensions/tokens_zip_prob.cu

  int64_t zipped_rows = zipped_expertwise_rowmap_shape[0];
-  int num_expert = zipped_expertwise_rowmap_shape[1];
-  int topk = dispatched_indices_shape[1];
-  PD_CHECK(unzipped_probs.size() == num_expert);
+  int64_t num_expert = zipped_expertwise_rowmap_shape[1];
+  int64_t topk = dispatched_indices_shape[1];
+  PD_CHECK(num_expert <= static_cast<int64_t>(std::numeric_limits<int>::max()),
+           "num_expert must be <= INT_MAX for tokens_zip_prob.");
+  PD_CHECK(topk <= static_cast<int64_t>(std::numeric_limits<int>::max()),
+           "topk must be <= INT_MAX for tokens_zip_prob.");
+  PD_CHECK(unzipped_probs.size() == static_cast<size_t>(num_expert),
+           "unzipped_probs.size() must equal num_expert.");
+  int num_expert_int = static_cast<int>(num_expert);
+  int topk_int = static_cast<int>(topk);

  auto zipped_probs =
      paddle::empty({zipped_rows, topk}, dtype, unzipped_probs[0].place());



tokens_zip_prob_impl 在检查 unzipped_probs.size()==num_expert 后仍直接访问 unzipped_probs[0] 来取 place；当 num_expert==0（且 unzipped_probs 为空）会越界，同时后续 rowmap 索引也不成立。建议显式要求 num_expert > 0（以及 topk > 0，如适用），或在 0 专家/0 topk 情况下提前返回一个空 shape 的 zipped_probs。

Copilot · 2026-02-12T09:42:23Z

src/paddlefleet/_extensions/utils.h

+                                                  const int64_t num_elements) {
  constexpr int vector_size_in_bytes = 16;
-  const int elements_per_vector = vector_size_in_bytes / sizeof(T);
+  const int64_t elements_per_vector = vector_size_in_bytes / sizeof(T);

-  int num_vectors = num_elements / elements_per_vector;
-  int remaining_elements = num_elements % elements_per_vector;
+  int64_t num_vectors = num_elements / elements_per_vector;
+  int64_t remaining_elements = num_elements % elements_per_vector;

  using VecType = VectorType<T, elements_per_vector>;
  const VecType* src_vec = reinterpret_cast<const VecType*>(src);
  VecType* dst_vec = reinterpret_cast<VecType*>(dst);


vectorized_memcpy 里 elements_per_vector 用于非类型模板参数（VectorType<T, elements_per_vector>），但当前是局部 const int64_t 变量而非 constexpr，且模板参数类型为 int。为避免编译器/标准差异导致的编译失败或推导问题，建议改成 constexpr int elements_per_vector = vector_size_in_bytes / sizeof(T); 并将后续 num_vectors/remaining_elements/offset 等继续用 int64_t 计算。

wanghuancoder

LGTM。PR修改原则为：1）能ENFORCE/CHECK拦截的不改int64；2）拦截不了的改int64，但目测对Kernel性能造成的印象有限。

zrr1999 · 2026-02-13T06:32:34Z

LGTM。PR修改原则为：1）能ENFORCE/CHECK拦截的不改int64；2）拦截不了的改int64，但目测对Kernel性能造成的印象有限。

收到

From00

LGTM

update

551f433

Copilot AI review requested due to automatic review settings February 12, 2026 09:30

Copilot started reviewing on behalf of zrr1999 February 12, 2026 09:31 View session

zrr1999 changed the title ~~update~~ [large tensor] fix extensions issues Feb 12, 2026

Copilot AI reviewed Feb 12, 2026

View reviewed changes

zrr1999 changed the title ~~[large tensor] fix extensions issues~~ [large tensor] fix CUDA extensions int64 overflow for large tensor dimensions Feb 12, 2026

wanghuancoder approved these changes Feb 13, 2026

View reviewed changes

From00 approved these changes Feb 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[large tensor] fix CUDA extensions int64 overflow for large tensor dimensions#561

[large tensor] fix CUDA extensions int64 overflow for large tensor dimensions#561
zrr1999 wants to merge 1 commit intoPaddlePaddle:developfrom
zrr1999:large-tensor/extensions

zrr1999 commented Feb 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 12, 2026

Uh oh!

Copilot AI Feb 12, 2026

Uh oh!

Copilot AI Feb 12, 2026

Uh oh!

Copilot AI Feb 12, 2026

Uh oh!

Copilot AI Feb 12, 2026

Uh oh!

wanghuancoder left a comment

Uh oh!

zrr1999 commented Feb 13, 2026

Uh oh!

From00 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-  if (N == 0) {
+  if (N == 0) {
+    // For N == 0, semantics require all-zero outputs instead of uninitialized memory.
+    count_output.zero_();
+    if (do_cumsum) {
+      cumsum_output.zero_();
+    }

Conversation

zrr1999 commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

wanghuancoder left a comment

Choose a reason for hiding this comment

Uh oh!

zrr1999 commented Feb 13, 2026

Uh oh!

From00 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zrr1999 commented Feb 12, 2026 •

edited

Loading