Optimized vector generation for VDB Benchmark by idevasena · Pull Request #227 · mlcommons/storage

idevasena · 2026-01-21T15:12:09Z

VDB Benchmark - Enhanced Vector Loader

Overview

The load_vdb.py script loads synthetic vectors into a Milvus vector database for benchmarking purposes. This enhanced version introduces CPU and memory optimizations while preserving backward compatibility with the original implementation.

Architecture Changes

High-Level Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                           load_vdb.py (Enhanced)                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────┐    ┌──────────────────────────────────────────────────┐   │
│  │   CLI Args  │───▶│              Mode Selection                      │   │
│  │  + Config   │    │  ┌──────────┬──────────────┬─────────────────┐   │   │
│  └─────────────┘    │  │ Standard │   Adaptive   │   Disk-Backed   │   │   │
│                     │  │  (default)│  (--adaptive)│  (--disk-backed)│   │   │
│                     │  └────┬─────┴──────┬───────┴────────┬────────┘   │   │
│                     └───────┼────────────┼────────────────┼────────────┘   │
│                             ▼            ▼                ▼                │
│                     ┌───────────────────────────────────────────────┐      │
│                     │           Vector Generation Engine            │      │
│                     │  • Seeded RNG (reproducibility)               │      │
│                     │  • NumPy float32 arrays                       │      │
│                     │  • L2 normalization                           │      │
│                     └───────────────────────────────────────────────┘      │
│                                          │                                 │
│                             ┌────────────┼────────────┐                    │
│                             ▼            ▼            ▼                    │
│                     ┌─────────────┐ ┌─────────┐ ┌───────────┐              │
│                     │   Chunked   │ │ Adaptive│ │  Mmap'd   │              │
│                     │   In-Memory │ │ Batching│ │   Disk    │              │
│                     │   Buffer    │ │Controller│ │  Buffer   │              │
│                     └──────┬──────┘ └────┬────┘ └─────┬─────┘              │
│                            └─────────────┼───────────┘                     │
│                                          ▼                                 │
│                     ┌───────────────────────────────────────────────┐      │
│                     │              Milvus Insertion                 │      │
│                     │  • Batch insert with progress tracking        │      │
│                     │  • Memory monitoring (psutil)                 │      │
│                     │  • Periodic garbage collection                │      │
│                     └───────────────────────────────────────────────┘      │
│                                          │                                 │
│                                          ▼                                 │
│                     ┌───────────────────────────────────────────────┐      │
│                     │         Post-Load Operations                  │      │
│                     │  • Flush → Index Build Monitor → Compact      │      │
│                     └───────────────────────────────────────────────┘      │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Component Details

1. Memory Management Utilities (New)

AdaptiveBatchController

Dynamically adjusts batch sizes based on real-time memory pressure.

┌─────────────────────────────────────────────────────────────┐
│                 AdaptiveBatchController                     │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Memory Threshold: 80%                                      │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Memory Usage    │  Action                          │   │
│  ├───────────────────┼─────────────────────────────────┤   │
│  │  > 80%           │  Scale down by 50%              │   │
│  │  < 55%           │  Scale up by 25% (after 10 batches) │
│  │  55% - 80%       │  Maintain current size          │   │
│  └───────────────────┴─────────────────────────────────┘   │
│                                                             │
│  Bounds: [min_batch_size, max_batch_size]                  │
│  Default: [batch_size/20, batch_size*5]                    │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Key Methods:

get_batch_size(): Returns current batch size, adjusting if memory threshold exceeded
force_scale_down(): Emergency reduction after insertion errors

DiskBackedBuffer

Memory-mapped file buffer for datasets exceeding available RAM.

┌─────────────────────────────────────────────────────────────┐
│                    DiskBackedBuffer                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Phase 1: Generate → Disk                                   │
│  ┌──────────┐    ┌──────────────────────────────────┐      │
│  │ generate │───▶│  Memory-Mapped File (.mmap)      │      │
│  │ vectors  │    │  Size: num_vectors × dim × 4 bytes│      │
│  └──────────┘    └──────────────────────────────────┘      │
│                              │                              │
│  Phase 2: Disk → Database    │                              │
│                              ▼                              │
│  ┌──────────────────────────────────────────────┐          │
│  │  read_batch() ──▶ Milvus insert()            │          │
│  │  (streaming, no full dataset in memory)      │          │
│  └──────────────────────────────────────────────┘          │
│                                                             │
│  Cleanup: Auto-delete temp file on exit                    │
│                                                             │
└─────────────────────────────────────────────────────────────┘

File Layout:

Offset 0                                      Offset N×D×4
┌─────────┬─────────┬─────────┬─────────┬─────────┐
│Vector 0 │Vector 1 │Vector 2 │   ...   │Vector N │
│ D×4 B   │ D×4 B   │ D×4 B   │         │ D×4 B   │
└─────────┴─────────┴─────────┴─────────┴─────────┘

2. Vector Generation Engine (Enhanced)

Original vs Enhanced Comparison

Aspect	Original	Enhanced
RNG	`np.random.random()`	`np.random.default_rng()`
Reproducibility	None	Seed + batch_index
Output Type	`list` (via `.tolist()`)	`np.ndarray` (float32)
Intermediate Type	`float16`	`float32`
Normalization	Yes	Yes (with zero-division protection)

Seeded Generation Flow

┌─────────────────────────────────────────────────────────────┐
│                  Reproducible Generation                    │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  seed = 42, batch_index = 0                                │
│       │                                                     │
│       ▼                                                     │
│  ┌─────────────────────────────────────┐                   │
│  │ rng = default_rng(seed + batch_index)│                   │
│  │     = default_rng(42 + 0)           │                   │
│  │     = default_rng(42)               │                   │
│  └─────────────────────────────────────┘                   │
│       │                                                     │
│       ▼                                                     │
│  Batch 0: rng(42)  → deterministic vectors                 │
│  Batch 1: rng(43)  → different but reproducible            │
│  Batch 2: rng(44)  → different but reproducible            │
│       ...                                                   │
│                                                             │
│  Re-run with same seed → identical dataset                 │
│                                                             │
└─────────────────────────────────────────────────────────────┘

3. Execution Modes

Mode Selection Logic

if args.disk_backed:
    mode = "disk-backed"      # Lowest memory, two-phase
elif args.adaptive:
    mode = "adaptive"         # Dynamic batch sizing
else:
    mode = "standard"         # Original behavior (chunked)

Standard Mode (Default)

Preserves original chunked approach with added memory optimizations.

┌─────────────────────────────────────────────────────────────┐
│                     Standard Mode                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  if num_vectors > chunk_size:                              │
│      ┌─────────────────────────────────────────┐           │
│      │  for each chunk (default 1M vectors):   │           │
│      │    1. generate_vectors(chunk_size)      │           │
│      │    2. insert_data(chunk_vectors)        │           │
│      │    3. del chunk_vectors                 │           │
│      │    4. gc.collect()                      │           │
│      └─────────────────────────────────────────┘           │
│  else:                                                      │
│      ┌─────────────────────────────────────────┐           │
│      │  insert_data_standard():                │           │
│      │    • Generate + insert per batch        │           │
│      │    • gc.collect() every 50 batches      │           │
│      └─────────────────────────────────────────┘           │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Adaptive Mode

Memory-pressure-aware execution with automatic batch scaling.

┌─────────────────────────────────────────────────────────────┐
│                     Adaptive Mode                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  while vectors_loaded < num_vectors:                │   │
│  │                                                     │   │
│  │    batch_size = controller.get_batch_size()        │   │
│  │         │                                          │   │
│  │         ├── Check psutil.virtual_memory()          │   │
│  │         ├── Scale down if > 80% used               │   │
│  │         └── Scale up if < 55% (after cooldown)     │   │
│  │                                                     │   │
│  │    try:                                            │   │
│  │        generate → insert → update count            │   │
│  │    except Error:                                   │   │
│  │        controller.force_scale_down()               │   │
│  │        continue  # retry with smaller batch        │   │
│  │                                                     │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
│  Output includes: batch_adjustments, errors count          │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Disk-Backed Mode

Two-phase approach for billion-scale datasets on memory-constrained systems.

┌─────────────────────────────────────────────────────────────┐
│                   Disk-Backed Mode                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Phase 1: Generate to Disk                                  │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  with DiskBackedBuffer(dim, num_vectors) as buf:    │   │
│  │                                                     │   │
│  │    for batch_idx in range(num_batches):            │   │
│  │        vectors = generate_vectors(batch_size)       │   │
│  │        buf.write_batch(vectors, offset)            │   │
│  │        del vectors  # immediate cleanup             │   │
│  │                                                     │   │
│  └─────────────────────────────────────────────────────┘   │
│                         │                                   │
│                         ▼                                   │
│  Phase 2: Stream to Database                                │
│  ┌─────────────────────────────────────────────────────┐   │
│  │    for start_id in range(0, num_vectors, batch):   │   │
│  │        vectors = buf.read_batch(start_id, count)    │   │
│  │        collection.insert([ids, vectors.tolist()])   │   │
│  │                                                     │   │
│  └─────────────────────────────────────────────────────┘   │
│                         │                                   │
│                         ▼                                   │
│  Cleanup: buf.__exit__() deletes temp file                 │
│                                                             │
└─────────────────────────────────────────────────────────────┘

4. Memory Optimization Strategies

Garbage Collection Points

# After each batch (all modes)
del vectors, data

# Periodic deep collection (every 50 batches)
if batch_idx % 50 == 0:
    gc.collect()

# After chunk completion (standard mode)
del chunk_vectors
gc.collect()

# After memory pressure detection (adaptive mode)
if mem_percent > threshold:
    gc.collect()

Memory Monitoring Integration

┌─────────────────────────────────────────────────────────────┐
│                  psutil Integration                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Optional dependency (graceful degradation)                │
│                                                             │
│  if PSUTIL_AVAILABLE:                                      │
│      • Log memory % in progress reports                    │
│      • Enable adaptive batch scaling                       │
│      • Report available RAM at startup                     │
│  else:                                                      │
│      • Use default values (50% usage, 8GB available)       │
│      • Adaptive mode works but without dynamic scaling     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

5. Data Flow Comparison

Original Flow

Config → Parse Args → Connect → Create Collection → Create Index
    → Generate ALL vectors (chunk) → Insert (batch loop) → Flush → Monitor

Enhanced Flow

Config → Parse Args → Connect → Create Collection → Create Index
    → Select Mode:
        ├─ Standard:  Chunked generate+insert with gc.collect()
        ├─ Adaptive:  Memory-monitored generate+insert with auto-scaling
        └─ Disk:      Phase1(generate→disk) → Phase2(disk→insert)
    → Flush → Monitor → Summary Report

New CLI Arguments

Argument	Type	Default	Description
`--seed`	int	None	Random seed for reproducible generation
`--adaptive`	flag	False	Enable memory-aware batch sizing
`--memory-budget`	str	"0"	Memory limit (e.g., `4G`, `512M`)
`--disk-backed`	flag	False	Use memory-mapped temp file
`--temp-dir`	str	system temp	Directory for disk-backed mode

Performance Characteristics

Mode	Memory Usage	Throughput	Best For
Standard	High (chunk_size × dim × 4B)	Highest	<100M vectors, adequate RAM
Adaptive	Variable (auto-regulated)	Medium-High	Variable memory, shared systems
Disk-Backed	Low (batch_size × dim × 4B)	Medium	>100M vectors, limited RAM

Memory Footprint Estimates

Standard Mode (1M chunk, 1536 dim):
    Chunk: 1,000,000 × 1,536 × 4 bytes = 5.7 GB
    
Adaptive Mode (10K batch, 1536 dim):
    Batch: 10,000 × 1,536 × 4 bytes = 58.6 MB
    Peak: ~2-3× batch size during insert
    
Disk-Backed Mode (10K batch, 1536 dim):
    Memory: ~58.6 MB per batch (streaming)
    Disk: num_vectors × 1,536 × 4 bytes (temp file)
    Example: 1B vectors = 5.7 TB temp file

Backward Compatibility

All original arguments and behaviors are preserved:

# Original command still works identically
python load_vdb.py --config vdbbench/configs/10m_diskann.yaml \
    --collection-name test \
    --num-vectors 1000000 \
    --batch-size 10000 \
    --force \
    --compact

The script automatically uses standard mode when no optimization flags are specified.

Dependencies

Required

numpy
pymilvus
pyyaml (via config_loader)

Optional

psutil - Enables memory monitoring and adaptive scaling

# Install optional dependency
pip install psutil

Error Handling

Scenario	Behavior
psutil not installed	Graceful degradation, uses defaults
Memory pressure (adaptive)	Auto-scale down batch size
Insert error (adaptive)	Force scale down, retry
Disk buffer cleanup failure	Warning logged, continues
Missing required params	Parser error with guidance

Summary Report

Enhanced summary output includes:

============================================================
Loading Summary
============================================================
Vectors loaded:    10,000,000
Total time:        245.3s
Throughput:        40,766 vectors/sec
Generation time:   45.2s
Insertion time:    198.1s
Batches:           1,000
Batch adjustments: 3          # (adaptive mode only)
Errors:            0          # (adaptive mode only)
============================================================

github-actions · 2026-01-21T15:12:20Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

russfellows · 2026-01-23T22:05:19Z

I'm wondering if it makes sense to use dgen-py instead of np.random.default_rng() ??

Since generation time is a major consideration, being able to run 10x faster could be a big improvement. Instead of days, it would be hours. The change is minimal. Here is the PyPi site: https://pypi.org/project/dgen-py/

Below is the EXACT Python program I ran on 6 different cloud instance sizes, with the results listed above on dgen-py. This shows the performance of 4 data gen methods, os.random (/dev/urandom), np.random(), Numba with a custom Xosiro256 algorithm, and dgen-py.

bench_dgen-vs-numba-numpy.py

Optimized vector generation for VDB Benchmark

70678dd

idevasena requested a review from a team January 21, 2026 15:12

idevasena requested a review from a team as a code owner January 21, 2026 15:12

FileSystemGuy approved these changes Jan 21, 2026

View reviewed changes

added recall metric implementatin to vdb benchmark script

f9ab288

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Optimized vector generation for VDB Benchmark#227

Optimized vector generation for VDB Benchmark#227
idevasena wants to merge 2 commits intomlcommons:TF_VDBBenchfrom
idevasena:TF_VDBBench

idevasena commented Jan 21, 2026

Uh oh!

github-actions bot commented Jan 21, 2026 •

edited

Loading

Uh oh!

russfellows commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

idevasena commented Jan 21, 2026

VDB Benchmark - Enhanced Vector Loader

Overview

Architecture Changes

High-Level Architecture

Component Details

1. Memory Management Utilities (New)

AdaptiveBatchController

DiskBackedBuffer

2. Vector Generation Engine (Enhanced)

Original vs Enhanced Comparison

Seeded Generation Flow

3. Execution Modes

Mode Selection Logic

Standard Mode (Default)

Adaptive Mode

Disk-Backed Mode

4. Memory Optimization Strategies

Garbage Collection Points

Memory Monitoring Integration

5. Data Flow Comparison

Original Flow

Enhanced Flow

New CLI Arguments

Performance Characteristics

Memory Footprint Estimates

Backward Compatibility

Dependencies

Required

Optional

Error Handling

Summary Report

Uh oh!

github-actions bot commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

russfellows commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Jan 21, 2026 •

edited

Loading