Fix wrong evicted values when insert failed/busy in insert_and_evict. #284

jiashuy · 2026-01-26T04:27:07Z

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

jiashuy · 2026-01-26T04:40:03Z

jiashuy · 2026-01-26T06:54:02Z

mask = torch.zeros(16, dtype=torch.bool, device="cuda:0")
mask[3] = True
x = torch.zeros(16, dtype=torch.int64, device="cuda:0")
y = x[mask]
y.sum()
torch.cuda.synchronize()

y = x[mask] will bring d2h, and we need customized kernel to eliminate the d2h

JacoCheung · 2026-01-26T07:30:08Z

corelib/dynamicemb/dynamicemb/key_value_table.py

+        if insert_busy_mask.sum().item() != 0:
+            out_indices = indices[insert_busy_mask]
+            evicted_values[out_indices, :] = values.to(self.value_type())[
+                insert_busy_mask
+            ]
+            indices[insert_busy_mask] = -1


We can remove the if statement( as well as h2d), can't we?
If there is no busy indices, indices[insert_busy_mask] will return empty tensor. And the folllowing ops should be nop?
@jiashuy

We can remove the if statement, but indices[insert_busy_mask] will bring d2h, because torch C++ uses cub to do mask select and synchronize to got the size of out_indices .

So to remove the d2h thoroughly, we need a cutomized CUDA kernel

Could you please try indices.masked_filled_(-1) ?

I don't know masked_filled, can evicted_values be filled use masked_filled ?
If so, it will be simple.

If it's possible not to build out_indices, it's easier

Oh sorry, I mean the line 1816. Not 1812.
I believe indices[insert_busy_mask] load implies an inevitable d2h. Yeah, I now agree with you. But I'm not sure if it's really necessary to remove the d2h. (Significant perf loss?)

@shijieliu and I think the more the d2h, the harder to pipeline the embedding's forward.
You can see there still some d2h in the forward, but we don't want to make it more.

And as for the performance, I haven't test it. But if we don't use pipeline, I think it make little diference here?
Insertion failure hardly happend.

OK. I found out a useful op:
src.masked_scatter_(mask, value)
https://docs.pytorch.org/docs/stable/generated/torch.Tensor.masked_scatter.html

jiashuy · 2026-01-26T09:11:12Z

CI

JacoCheung · 2026-01-26T15:13:36Z

corelib/dynamicemb/dynamicemb/key_value_table.py

+        )
        evicted_scores = evicted_scores[0]

+        select_insert_failed_values(


Have you tried the masked_scatter_() operation? I think if it meets our requirements, we can adopt it considering the maintenance and robustness. (Unless the perf is really dissatisfying).

No, I haven't tried.
Not sure whether it supports CUDA device, bfloat16 dtype and multi-dimension.
In order to achieve the goal quickly, I implemented this fused kernel yesterday.
I will try the masked_scatter_ and masked_filled_ in the future, maybe another PR, how do you think? @JacoCheung

(torch) It should support all cases. My intent is to shift the repsonsibility to pytorch. So we have less job (including compile the unit).
I think it would be better to verify at this moment. 🚀

jiashuy · 2026-01-27T01:38:41Z

/review

greptile-apps · 2026-01-27T02:09:55Z

Greptile Summary

This PR fixes a critical bug where evicted values were incorrectly populated when insertion operations failed or were busy in insert_and_evict. The issue occurred because when an insertion failed with InsertResult::Busy, the index wasn't properly set to reference the evicted buffer position (out_id), making it impossible to copy the input values to the correct location in the evicted values buffer.

The fix consists of two parts:

Kernel change (kernels.cuh:424-425): When insertion fails with Busy status, the index is now set to out_id (the position in the evicted buffer) instead of remaining uninitialized
New CUDA kernel (dynamic_emb_op.cu): A new optimized select_insert_failed_values kernel replaces the previous Python-based approach, efficiently copying input values to the evicted buffer for failed insertions and setting indices to -1

The implementation includes both vectorized (Vec4) and non-vectorized variants for optimal performance across different embedding dimensions. Tests have been updated to properly handle the Busy state by setting indices to -1.

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The fix correctly addresses a critical bug with a clean implementation. The logic is sound: setting index = out_id when result == InsertResult::Busy ensures proper mapping between failed insertions and the evicted buffer. The new CUDA kernel is well-structured with proper bounds checking and follows existing patterns in the codebase. Tests have been updated to validate the fix.
No files require special attention

Important Files Changed

Filename	Overview
corelib/dynamicemb/src/table_operation/kernels.cuh	Added logic to store `out_id` in the index when insert result is `Busy`, enabling proper evicted value handling in the Python layer
corelib/dynamicemb/dynamicemb/key_value_table.py	Replaced commented Python-based insertion failure handling with new CUDA kernel `select_insert_failed_values` for better performance
corelib/dynamicemb/src/dynamic_emb_op.cu	Implemented new `select_insert_failed_values` CUDA kernel with vectorized and non-vectorized variants to copy input values to evicted buffer for failed insertions
corelib/dynamicemb/test/unit_tests/table_operation/test_table_operation.py	Updated tests to properly set `indices[insert_busy_mask] = -1` after insert_and_evict operations to handle insertion failures

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant KVT as DynamicEmbeddingTable
    participant KIM as KeyIndexMap
    participant Kernel as table_insert_and_evict_kernel
    participant Select as select_insert_failed_values
    participant Load as load_from_combined_table
    
    User->>KVT: insert_and_evict(keys, values)
    KVT->>KVT: Allocate insert_results tensor
    KVT->>KIM: insert_and_evict(keys, indices, insert_results)
    KIM->>Kernel: Launch CUDA kernel
    
    Note over Kernel: For each key insertion attempt
    alt Insert succeeds
        Kernel->>Kernel: Set index = bucket_id * capacity + iter
        Kernel->>Kernel: Set insert_results[i] = Insert/Reclaim/Assign/Evict
    else Insert fails (Busy)
        Kernel->>Kernel: Compute out_id from evicted_counter
        Kernel->>Kernel: Set index = out_id (NEW FIX)
        Kernel->>Kernel: Set insert_results[i] = Busy
        Kernel->>Kernel: Store key in evicted_keys[out_id]
    end
    
    Kernel-->>KIM: Return evicted data
    KIM-->>KVT: Return num_evicted, evicted_keys, evicted_indices
    
    KVT->>Select: select_insert_failed_values(insert_results, indices, values, evicted_values)
    
    Note over Select: CUDA kernel processes failed insertions
    loop For each batch item with Busy status
        Select->>Select: Read out_idx = indices[emb_id]
        Select->>Select: Copy values[emb_id] to evicted_values[out_idx]
        Select->>Select: Set indices[emb_id] = -1
    end
    
    Select-->>KVT: Return (indices updated)
    
    KVT->>Load: load_from_combined_table(evicted_indices, evicted_values)
    Load-->>KVT: Load complete
    KVT-->>User: Return success

Fix wrong evicted values when insert failed/busy in insert_and_evict.

c80fc74

jiashuy force-pushed the fix_prefetch_bug branch from 75d8152 to c80fc74 Compare January 26, 2026 04:29

jiashuy mentioned this pull request Jan 26, 2026

[BUG] Prefetch in dynamicemb make loss mismatch in HSTU #280

Open

JacoCheung reviewed Jan 26, 2026

View reviewed changes

remove d2h when dealing with insertion failure and fix tests

e30dd38

JacoCheung reviewed Jan 26, 2026

View reviewed changes

Fix wrong evicted values when insert failed/busy in insert_and_evict. #284

Are you sure you want to change the base?

Fix wrong evicted values when insert failed/busy in insert_and_evict. #284

Uh oh!

Conversation

jiashuy commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

jiashuy commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiashuy commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JacoCheung Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JacoCheung Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiashuy commented Jan 26, 2026

Uh oh!

JacoCheung Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiashuy commented Jan 27, 2026

Uh oh!

greptile-apps bot commented Jan 27, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jiashuy commented Jan 26, 2026 •

edited

Loading

jiashuy commented Jan 26, 2026 •

edited

Loading

jiashuy commented Jan 26, 2026 •

edited

Loading

JacoCheung Jan 26, 2026 •

edited

Loading

JacoCheung Jan 26, 2026 •

edited

Loading

JacoCheung Jan 26, 2026 •

edited

Loading