Skip to content

Conversation

@jiashuy
Copy link
Collaborator

@jiashuy jiashuy commented Jan 27, 2026

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@jiashuy jiashuy added the dynamicemb Related with dynamicemb label Jan 27, 2026
@greptile-apps
Copy link

greptile-apps bot commented Jan 27, 2026

Greptile Summary

This PR implements dynamic resizing for embedding tables with significant architectural changes:

Key Changes:

  • Refactored from dual-table architecture (dev_table/uvm_table split) to unified single-table design (self._table)
  • Implemented VMM (Virtual Memory Management) based extendable buffers for both device and host memory
  • Added NO_EVICTION mode that triggers automatic rehashing when capacity is exceeded
  • Simplified CUDA kernels by removing split_index logic throughout optimizer and table operations
  • All optimizer functions renamed from *_for_combined_table to *_for_table

Critical Issue Found:

  • Line 1317 in key_value_table.py contains a typo: torch.arrange should be torch.arange. This will cause immediate runtime failure when NO_EVICTION mode generates scores.

Architecture Improvements:

  • VMM implementation allows dynamic memory extension while preserving base addresses
  • Cleaner abstraction with ExtendableBuffer interface
  • Reduced code complexity by unifying memory handling paths

Concerns:

  • PR is marked as WIP with incomplete checklist (no tests confirmation, no documentation updates)
  • The rehashing logic in NO_EVICTION mode could lead to unbounded recursion if memory allocation fails repeatedly
  • Multiple print statements for debugging suggest production-readiness concerns

Confidence Score: 1/5

  • This PR contains a critical syntax error that will cause runtime failures
  • Score of 1 reflects the presence of a critical bug (torch.arrange typo) that will immediately break NO_EVICTION mode functionality. While the architectural refactoring appears sound, the syntax error must be fixed before merge.
  • Pay close attention to corelib/dynamicemb/dynamicemb/key_value_table.py which contains the critical typo on line 1317, and verify the NO_EVICTION mode logic thoroughly once fixed.

Important Files Changed

Filename Overview
corelib/dynamicemb/dynamicemb/key_value_table.py Major refactoring from dual-table (dev/uvm) to single table architecture. Contains critical typo: torch.arrange instead of torch.arange on line 1317 that will cause runtime errors.
corelib/dynamicemb/src/vmm_tensor.cu New file implementing VMM-based memory management for device and host memory. Includes proper error handling and memory cleanup in destructors.
corelib/dynamicemb/src/dynamic_emb_op.cu Simplified from combined table (dev/uvm split) to single table operations. Removed split_index logic and consolidated load/store operations.
corelib/dynamicemb/src/optimizer.cu Refactored optimizer functions from *_for_combined_table to *_for_table, removing dev/uvm split logic. Cleaner single-table interface.
corelib/dynamicemb/dynamicemb/batched_dynamicemb_tables.py Added NO_EVICTION score strategy support and improved cache decision logic with better HBM budget handling.

Sequence Diagram

sequenceDiagram
    participant App as Application
    participant Table as DynamicEmbeddingTable
    participant KVMap as KeyIndexMap (Hash Table)
    participant VMM as VMMTensor/HostVMMTensor
    participant Buffer as ExtendableBuffer
    participant CUDA as CUDA/CU Driver API

    Note over App,CUDA: Initialization Flow
    App->>Table: Create table with capacity
    Table->>Buffer: Create VmmDeviceBuffer/RegisteredHostBuffer
    Buffer->>VMM: Initialize VMMTensor(capacity, dtype, device)
    VMM->>CUDA: cuMemAddressReserve (reserve VA space)
    VMM->>CUDA: cuMemCreate + cuMemMap (allocate initial pages)
    VMM->>CUDA: cuMemSetAccess (set permissions)
    VMM-->>Buffer: Return tensor wrapper
    Table->>KVMap: Create hash table with capacity
    
    Note over App,CUDA: Insert Flow (NO_EVICTION Mode)
    App->>Table: insert(keys, values)
    Table->>Table: Generate scores = arange(valid_rows, valid_rows+n)
    Table->>KVMap: insert_and_evict(keys, scores)
    alt Hash table full (eviction occurs)
        KVMap-->>Table: Return evicted_keys, evicted_scores
        Table->>Table: load_from_table(evicted_scores, evicted_values)
        Table->>Table: rehash(capacity * 2)
        Table->>KVMap: Create new hash table (2x capacity)
        Table->>Table: Re-insert all entries to new hash table
        Table->>Buffer: extend(capacity * 2)
        Buffer->>VMM: extend(new_capacity)
        VMM->>CUDA: cuMemCreate + cuMemMap (map new pages)
        VMM->>CUDA: cuMemSetAccess (set new page permissions)
        Table->>Table: store_to_table(scores, values)
        Table->>Table: Recursive insert(evicted_keys, evicted_values)
    else No eviction
        Table->>Table: store_to_table(scores, values)
    end
    
    Note over App,CUDA: Lookup and Update Flow
    App->>Table: update(keys, grads)
    Table->>KVMap: lookup(keys) → indices
    Table->>CUDA: optimizer_update_kernel(grads, table, indices)
    CUDA->>VMM: Read/write embedding table via mapped memory
    
    Note over App,CUDA: Memory Extension
    Note right of VMM: All extensions preserve base address
    VMM->>CUDA: Map additional pages at offset
    CUDA-->>VMM: Extended virtual address range
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

19 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

policy = ScorePolicy.ASSIGN

if self._no_eviction and scores is None:
scores = torch.arrange(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: torch.arrange is not a valid PyTorch function, should be torch.arange

Suggested change
scores = torch.arrange(
scores = torch.arange(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dynamicemb Related with dynamicemb

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant