perf: Optimize CUDA graph batch size selection and padding#56
Open
louiswang524 wants to merge 1 commit intosgl-project:mainfrom
Open
perf: Optimize CUDA graph batch size selection and padding#56louiswang524 wants to merge 1 commit intosgl-project:mainfrom
louiswang524 wants to merge 1 commit intosgl-project:mainfrom
Conversation
Improve CUDA graph efficiency by reducing padding waste and using binary search for batch size lookups. Key improvements: 1. Fine-grained batch size coverage: - Small batches (1-7): Capture every size for zero padding waste - Medium batches (8-32): Step by 4 instead of 8 - Large batches (32+): Keep step of 8 for memory efficiency 2. Binary search optimization: - Replace linear search with bisect.bisect_left for O(log n) lookup - Cleaner code with proper edge case handling Performance impact: - Reduces average padding waste from 17.3% to 7.3% (9.9% improvement) - Particularly beneficial for common small batch sizes (3, 5, 7, 9, 11) - Trade-off: ~7 additional graphs (~700MB memory for max_bs=160) Examples: - Batch size 3: 25% waste -> 0% waste (perfect fit) - Batch size 5: 37.5% waste -> 0% waste (perfect fit) - Batch size 9: 43.8% waste -> 25% waste - Batch size 11: 31.2% waste -> 8.3% waste The memory overhead is acceptable for modern GPUs (>40GB VRAM) and the improved batch packing efficiency results in better GPU utilization.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Improve CUDA graph efficiency by reducing padding waste and using binary search for batch size lookups.
Key improvements:
Fine-grained batch size coverage:
Binary search optimization:
Performance impact:
Examples:
The memory overhead is acceptable for modern GPUs (>40GB VRAM) and the improved batch packing efficiency results in better GPU utilization.