Skip to content

Conversation

@jiashuy
Copy link
Collaborator

@jiashuy jiashuy commented Jan 26, 2026

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@jiashuy
Copy link
Collaborator Author

jiashuy commented Jan 26, 2026

CI

@jiashuy
Copy link
Collaborator Author

jiashuy commented Jan 26, 2026

image
mask = torch.zeros(16, dtype=torch.bool, device="cuda:0")
mask[3] = True
x = torch.zeros(16, dtype=torch.int64, device="cuda:0")
y = x[mask]
y.sum()
torch.cuda.synchronize()

y = x[mask] will bring d2h, and we need customized kernel to eliminate the d2h

Comment on lines 1811 to 1816
if insert_busy_mask.sum().item() != 0:
out_indices = indices[insert_busy_mask]
evicted_values[out_indices, :] = values.to(self.value_type())[
insert_busy_mask
]
indices[insert_busy_mask] = -1
Copy link
Collaborator

@JacoCheung JacoCheung Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the if statement( as well as h2d), can't we?
If there is no busy indices, indices[insert_busy_mask] will return empty tensor. And the folllowing ops should be nop?
@jiashuy

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the if statement, but indices[insert_busy_mask] will bring d2h, because torch C++ uses cub to do mask select and synchronize to got the size of out_indices .

So to remove the d2h thoroughly, we need a cutomized CUDA kernel

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please try indices.masked_filled_(-1) ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know masked_filled, can evicted_values be filled use masked_filled ?
If so, it will be simple.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's possible not to build out_indices, it's easier

Copy link
Collaborator

@JacoCheung JacoCheung Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, I mean the line 1816. Not 1812.
I believe indices[insert_busy_mask] load implies an inevitable d2h. Yeah, I now agree with you. But I'm not sure if it's really necessary to remove the d2h. (Significant perf loss?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shijieliu and I think the more the d2h, the harder to pipeline the embedding's forward.
You can see there still some d2h in the forward, but we don't want to make it more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And as for the performance, I haven't test it. But if we don't use pipeline, I think it make little diference here?
Insertion failure hardly happend.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I found out a useful op:
src.masked_scatter_(mask, value)
https://docs.pytorch.org/docs/stable/generated/torch.Tensor.masked_scatter.html

@jiashuy
Copy link
Collaborator Author

jiashuy commented Jan 26, 2026

CI

)
evicted_scores = evicted_scores[0]

select_insert_failed_values(
Copy link
Collaborator

@JacoCheung JacoCheung Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried the masked_scatter_() operation? I think if it meets our requirements, we can adopt it considering the maintenance and robustness. (Unless the perf is really dissatisfying).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I haven't tried.
Not sure whether it supports CUDA device, bfloat16 dtype and multi-dimension.
In order to achieve the goal quickly, I implemented this fused kernel yesterday.
I will try the masked_scatter_ and masked_filled_ in the future, maybe another PR, how do you think? @JacoCheung

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(torch) It should support all cases. My intent is to shift the repsonsibility to pytorch. So we have less job (including compile the unit).
I think it would be better to verify at this moment. 🚀

@jiashuy
Copy link
Collaborator Author

jiashuy commented Jan 27, 2026

/review

@greptile-apps
Copy link

greptile-apps bot commented Jan 27, 2026

Greptile Summary

This PR fixes a critical bug where evicted values were incorrectly populated when insertion operations failed or were busy in insert_and_evict. The issue occurred because when an insertion failed with InsertResult::Busy, the index wasn't properly set to reference the evicted buffer position (out_id), making it impossible to copy the input values to the correct location in the evicted values buffer.

The fix consists of two parts:

  • Kernel change (kernels.cuh:424-425): When insertion fails with Busy status, the index is now set to out_id (the position in the evicted buffer) instead of remaining uninitialized
  • New CUDA kernel (dynamic_emb_op.cu): A new optimized select_insert_failed_values kernel replaces the previous Python-based approach, efficiently copying input values to the evicted buffer for failed insertions and setting indices to -1

The implementation includes both vectorized (Vec4) and non-vectorized variants for optimal performance across different embedding dimensions. Tests have been updated to properly handle the Busy state by setting indices to -1.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The fix correctly addresses a critical bug with a clean implementation. The logic is sound: setting index = out_id when result == InsertResult::Busy ensures proper mapping between failed insertions and the evicted buffer. The new CUDA kernel is well-structured with proper bounds checking and follows existing patterns in the codebase. Tests have been updated to validate the fix.
  • No files require special attention

Important Files Changed

Filename Overview
corelib/dynamicemb/src/table_operation/kernels.cuh Added logic to store out_id in the index when insert result is Busy, enabling proper evicted value handling in the Python layer
corelib/dynamicemb/dynamicemb/key_value_table.py Replaced commented Python-based insertion failure handling with new CUDA kernel select_insert_failed_values for better performance
corelib/dynamicemb/src/dynamic_emb_op.cu Implemented new select_insert_failed_values CUDA kernel with vectorized and non-vectorized variants to copy input values to evicted buffer for failed insertions
corelib/dynamicemb/test/unit_tests/table_operation/test_table_operation.py Updated tests to properly set indices[insert_busy_mask] = -1 after insert_and_evict operations to handle insertion failures

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant KVT as DynamicEmbeddingTable
    participant KIM as KeyIndexMap
    participant Kernel as table_insert_and_evict_kernel
    participant Select as select_insert_failed_values
    participant Load as load_from_combined_table
    
    User->>KVT: insert_and_evict(keys, values)
    KVT->>KVT: Allocate insert_results tensor
    KVT->>KIM: insert_and_evict(keys, indices, insert_results)
    KIM->>Kernel: Launch CUDA kernel
    
    Note over Kernel: For each key insertion attempt
    alt Insert succeeds
        Kernel->>Kernel: Set index = bucket_id * capacity + iter
        Kernel->>Kernel: Set insert_results[i] = Insert/Reclaim/Assign/Evict
    else Insert fails (Busy)
        Kernel->>Kernel: Compute out_id from evicted_counter
        Kernel->>Kernel: Set index = out_id (NEW FIX)
        Kernel->>Kernel: Set insert_results[i] = Busy
        Kernel->>Kernel: Store key in evicted_keys[out_id]
    end
    
    Kernel-->>KIM: Return evicted data
    KIM-->>KVT: Return num_evicted, evicted_keys, evicted_indices
    
    KVT->>Select: select_insert_failed_values(insert_results, indices, values, evicted_values)
    
    Note over Select: CUDA kernel processes failed insertions
    loop For each batch item with Busy status
        Select->>Select: Read out_idx = indices[emb_id]
        Select->>Select: Copy values[emb_id] to evicted_values[out_idx]
        Select->>Select: Set indices[emb_id] = -1
    end
    
    Select-->>KVT: Return (indices updated)
    
    KVT->>Load: load_from_combined_table(evicted_indices, evicted_values)
    Load-->>KVT: Load complete
    KVT-->>User: Return success
Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants