[QDP] add batch kernel support #700

guan404ming · 2025-12-08T11:50:36Z

Purpose of PR

Implemented batch encoding for amplitude encoding in the GPU kernel, optimizing memory usage and performance.
Refactored data handling to streamline the process of reading from Parquet files and encoding to GPU.
Updated the benchmark script to allow selection of frameworks for testing (Mahout, PennyLane, Qiskit).
Added verification functionality to compare quantum states across different frameworks.

Related Issues or PRs

Related to [QDP] [Benchmark] Add a End-to-End benchmark to compare with speed. #695

Changes Made

Breaking Changes

Yes
No

Checklist

Added or updated unit tests for all changes
Added or updated documentation for all changes
Successfully built and ran all unit tests or manual tests locally
PR title follows "MAHOUT-XXX: Brief Description" format (if related to an issue)
Code follows ASF guidelines

guan404ming · 2025-12-08T11:52:12Z

(qdp-python) titan% python3 benchmark/benchmark_e2e_final.py --frameworks all
Generating 200 samples of 16 qubits...
  Generated 200 samples
  Parquet file size: 100.28 MB

======================================================================
E2E BENCHMARK: 16 Qubits, 200 Samples
======================================================================

[PennyLane] Full Pipeline (Disk -> GPU)...
  IO Time: 0.3730 s
/home/gmchiu/Documents/GitHub/mahout/qdp/benchmark/benchmark_e2e_final.py:207: UserWarning: Casting complex values to real discards the imaginary part (Triggered internally at ../aten/src/ATen/native/Copy.cpp:301.)
  state_gpu = state_cpu.to("cuda", dtype=torch.float32)
  Total Time: 0.4661 s

[Qiskit] Full Pipeline (Disk -> GPU)...
  IO Time: 0.3553 s
    Processed 60/64 vectors...
  Total Time: 40.3154 s

[Mahout] Full Pipeline (Disk -> GPU)...
  Parquet->GPU (IO+Encode): 0.2692 s
  DLPack conversion: 0.0001 s
  Reshape & convert: 0.0026 s
  Total Time: 0.2743 s

======================================================================
E2E LATENCY (Lower is Better)
Samples: 200, Qubits: 16
======================================================================
Mahout           0.2743 s
PennyLane        0.4661 s
Qiskit          40.3154 s
----------------------------------------------------------------------
Speedup vs PennyLane:       1.70x
Speedup vs Qiskit:        146.96x

======================================================================
VERIFICATION (Mahout vs PennyLane)
======================================================================
Max Probability Difference: 1.02e-18
Max Amplitude Difference:   7.55e-17
>> SUCCESS: Quantum States Match!

======================================================================
VERIFICATION (Mahout vs Qiskit)
======================================================================
Max Probability Difference: 4.94e-12
Max Amplitude Difference:   2.33e-10
>> SUCCESS: Quantum States Match!

======================================================================
VERIFICATION (PennyLane vs Qiskit)
======================================================================
Max Probability Difference: 4.94e-12
Max Amplitude Difference:   2.33e-10
>> SUCCESS: Quantum States Match!

guan404ming · 2025-12-08T12:46:01Z

Point needs improved and possible solution

Single Large Memory Allocation -> adopt a paged state vector requiring kernel changes
read_parquet_batch blocks -> introduce double-buffered async I/O to overlap CPU read with GPU compute and improve throughput.

400Ping

Overall LGTM

400Ping · 2025-12-08T15:00:27Z

Point needs improved and possible solution

Single Large Memory Allocation -> adopt a paged state vector requiring kernel changes

read_parquet_batch blocks -> introduce double-buffered async I/O to overlap CPU read with GPU compute and improve throughput.

I could help with this if you need.

rich7420 · 2025-12-09T09:15:34Z

LGTM !
here's my result

======================================================================
E2E BENCHMARK: 16 Qubits, 2000 Samples
======================================================================

[PennyLane] Full Pipeline (Disk -> GPU)...
  IO Time: 1.5408 s
/home/rich-wsl/mahout/qdp/qdp-python/../benchmark/benchmark_e2e_final.py:207: UserWarning: Casting complex values to real discards the imaginary part (Triggered internally at ../aten/src/ATen/native/Copy.cpp:301.)
  state_gpu = state_cpu.to("cuda", dtype=torch.float32)
  Total Time: 2.6046 s

[Qiskit] Full Pipeline (Disk -> GPU)...
  IO Time: 1.3602 s
    Processed 10/16 vectors...
  Total Time: 314.1786 s

[Mahout] Full Pipeline (Disk -> GPU)...
  Parquet->GPU (IO+Encode): 1.5916 s
  DLPack conversion: 0.0002 s
  Reshape & convert: 0.0870 s
  Total Time: 1.7024 s

======================================================================
E2E LATENCY (Lower is Better)
Samples: 2000, Qubits: 16
======================================================================
Mahout           1.7024 s
PennyLane        2.6046 s
Qiskit         314.1786 s
----------------------------------------------------------------------
Speedup vs PennyLane:       1.53x
Speedup vs Qiskit:        184.56x

======================================================================
VERIFICATION (Mahout vs PennyLane)
======================================================================
Max Probability Difference: 1.21e-18
Max Amplitude Difference:   8.93e-17
>> SUCCESS: Quantum States Match!

======================================================================
VERIFICATION (Mahout vs Qiskit)
======================================================================
Max Probability Difference: 4.96e-12
Max Amplitude Difference:   2.33e-10
>> SUCCESS: Quantum States Match!

======================================================================
VERIFICATION (PennyLane vs Qiskit)
======================================================================
Max Probability Difference: 4.96e-12
Max Amplitude Difference:   2.33e-10
>> SUCCESS: Quantum States Match!

rich7420 · 2025-12-09T09:17:52Z

Single Large Memory Allocation -> adopt a paged state vector requiring kernel changes

read_parquet_batch blocks -> introduce double-buffered async I/O to overlap CPU read with GPU compute and improve throughput.

I think this part should be like Iterator way to fit every size of RAM and prevent OOM at same time.

guan404ming · 2025-12-09T11:03:26Z

I could help with this if you need.

Thanks for the kind words! I’ve already implemented some related parts, but really appreciate your willingness to help. Once I wrap things up, I’d be happy to have you review them.

guan404ming · 2025-12-09T11:04:12Z

I think this part should be like Iterator way to fit every size of RAM and prevent OOM at same time.

I am not that really familiar with this part, maybe you could help with this, thanks!

guan404ming · 2025-12-09T11:05:27Z

Merge, feel free to open pr to refine this one, thanks for all review!

* [QDP] Add batch encoding support * Refactor batch pre-processing

[QDP] Add batch encoding support

947def4

guan404ming changed the base branch from main to dev-qdp December 8, 2025 11:50

Refactor batch pre-processing

b219d2a

400Ping approved these changes Dec 8, 2025

View reviewed changes

rich7420 approved these changes Dec 9, 2025

View reviewed changes

guan404ming merged commit ce4b7ca into apache:dev-qdp Dec 9, 2025
2 checks passed

guan404ming deleted the optimized-batch-kernel branch December 9, 2025 11:05

guan404ming added a commit to guan404ming/mahout that referenced this pull request Dec 11, 2025

[QDP] add batch kernel support (apache#700)

ef4aa62

* [QDP] Add batch encoding support * Refactor batch pre-processing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QDP] add batch kernel support #700

[QDP] add batch kernel support #700

Uh oh!

guan404ming commented Dec 8, 2025 •

edited by rich7420

Loading

Uh oh!

guan404ming commented Dec 8, 2025

Uh oh!

guan404ming commented Dec 8, 2025

Uh oh!

400Ping left a comment

Uh oh!

400Ping commented Dec 8, 2025

Uh oh!

rich7420 commented Dec 9, 2025 •

edited

Loading

Uh oh!

rich7420 commented Dec 9, 2025

Uh oh!

guan404ming commented Dec 9, 2025

Uh oh!

guan404ming commented Dec 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

guan404ming commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[QDP] add batch kernel support #700

[QDP] add batch kernel support #700

Uh oh!

Conversation

guan404ming commented Dec 8, 2025 • edited by rich7420 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of PR

Related Issues or PRs

Changes Made

Breaking Changes

Checklist

Uh oh!

guan404ming commented Dec 8, 2025

Uh oh!

guan404ming commented Dec 8, 2025

Uh oh!

400Ping left a comment

Choose a reason for hiding this comment

Uh oh!

400Ping commented Dec 8, 2025

Uh oh!

rich7420 commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rich7420 commented Dec 9, 2025

Uh oh!

guan404ming commented Dec 9, 2025

Uh oh!

guan404ming commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

guan404ming commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

guan404ming commented Dec 8, 2025 •

edited by rich7420

Loading

rich7420 commented Dec 9, 2025 •

edited

Loading

guan404ming commented Dec 9, 2025 •

edited

Loading