Optimized vector generation for VDB Benchmark#227
Open
idevasena wants to merge 2 commits intomlcommons:TF_VDBBenchfrom
Open
Optimized vector generation for VDB Benchmark#227idevasena wants to merge 2 commits intomlcommons:TF_VDBBenchfrom
idevasena wants to merge 2 commits intomlcommons:TF_VDBBenchfrom
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
FileSystemGuy
approved these changes
Jan 21, 2026
|
I'm wondering if it makes sense to use dgen-py instead of np.random.default_rng() ?? Since generation time is a major consideration, being able to run 10x faster could be a big improvement. Instead of days, it would be hours. The change is minimal. Here is the PyPi site: https://pypi.org/project/dgen-py/ Below is the EXACT Python program I ran on 6 different cloud instance sizes, with the results listed above on dgen-py. This shows the performance of 4 data gen methods, os.random (/dev/urandom), np.random(), Numba with a custom Xosiro256 algorithm, and dgen-py. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
VDB Benchmark - Enhanced Vector Loader
Overview
The
load_vdb.pyscript loads synthetic vectors into a Milvus vector database for benchmarking purposes. This enhanced version introduces CPU and memory optimizations while preserving backward compatibility with the original implementation.Architecture Changes
High-Level Architecture
Component Details
1. Memory Management Utilities (New)
AdaptiveBatchController
Dynamically adjusts batch sizes based on real-time memory pressure.
Key Methods:
get_batch_size(): Returns current batch size, adjusting if memory threshold exceededforce_scale_down(): Emergency reduction after insertion errorsDiskBackedBuffer
Memory-mapped file buffer for datasets exceeding available RAM.
File Layout:
2. Vector Generation Engine (Enhanced)
Original vs Enhanced Comparison
np.random.random()np.random.default_rng()list(via.tolist())np.ndarray(float32)float16float32Seeded Generation Flow
3. Execution Modes
Mode Selection Logic
Standard Mode (Default)
Preserves original chunked approach with added memory optimizations.
Adaptive Mode
Memory-pressure-aware execution with automatic batch scaling.
Disk-Backed Mode
Two-phase approach for billion-scale datasets on memory-constrained systems.
4. Memory Optimization Strategies
Garbage Collection Points
Memory Monitoring Integration
5. Data Flow Comparison
Original Flow
Enhanced Flow
New CLI Arguments
--seed--adaptive--memory-budget4G,512M)--disk-backed--temp-dirPerformance Characteristics
Memory Footprint Estimates
Backward Compatibility
All original arguments and behaviors are preserved:
The script automatically uses standard mode when no optimization flags are specified.
Dependencies
Required
numpypymilvuspyyaml(via config_loader)Optional
psutil- Enables memory monitoring and adaptive scaling# Install optional dependency pip install psutilError Handling
Summary Report
Enhanced summary output includes: