Skip to content

Conversation

@guan404ming
Copy link
Member

@guan404ming guan404ming commented Dec 8, 2025

Purpose of PR

  • Implemented batch encoding for amplitude encoding in the GPU kernel, optimizing memory usage and performance.
  • Refactored data handling to streamline the process of reading from Parquet files and encoding to GPU.
  • Updated the benchmark script to allow selection of frameworks for testing (Mahout, PennyLane, Qiskit).
  • Added verification functionality to compare quantum states across different frameworks.

Related Issues or PRs

Changes Made

  • Bug fix
  • New feature
  • Refactoring
  • Documentation
  • Test
  • CI/CD pipeline
  • Other

Breaking Changes

  • Yes
  • No

Checklist

  • Added or updated unit tests for all changes
  • Added or updated documentation for all changes
  • Successfully built and ran all unit tests or manual tests locally
  • PR title follows "MAHOUT-XXX: Brief Description" format (if related to an issue)
  • Code follows ASF guidelines

@guan404ming guan404ming changed the base branch from main to dev-qdp December 8, 2025 11:50
@guan404ming
Copy link
Member Author

(qdp-python) titan% python3 benchmark/benchmark_e2e_final.py --frameworks all
Generating 200 samples of 16 qubits...
  Generated 200 samples
  Parquet file size: 100.28 MB

======================================================================
E2E BENCHMARK: 16 Qubits, 200 Samples
======================================================================

[PennyLane] Full Pipeline (Disk -> GPU)...
  IO Time: 0.3730 s
/home/gmchiu/Documents/GitHub/mahout/qdp/benchmark/benchmark_e2e_final.py:207: UserWarning: Casting complex values to real discards the imaginary part (Triggered internally at ../aten/src/ATen/native/Copy.cpp:301.)
  state_gpu = state_cpu.to("cuda", dtype=torch.float32)
  Total Time: 0.4661 s

[Qiskit] Full Pipeline (Disk -> GPU)...
  IO Time: 0.3553 s
    Processed 60/64 vectors...
  Total Time: 40.3154 s

[Mahout] Full Pipeline (Disk -> GPU)...
  Parquet->GPU (IO+Encode): 0.2692 s
  DLPack conversion: 0.0001 s
  Reshape & convert: 0.0026 s
  Total Time: 0.2743 s

======================================================================
E2E LATENCY (Lower is Better)
Samples: 200, Qubits: 16
======================================================================
Mahout           0.2743 s
PennyLane        0.4661 s
Qiskit          40.3154 s
----------------------------------------------------------------------
Speedup vs PennyLane:       1.70x
Speedup vs Qiskit:        146.96x

======================================================================
VERIFICATION (Mahout vs PennyLane)
======================================================================
Max Probability Difference: 1.02e-18
Max Amplitude Difference:   7.55e-17
>> SUCCESS: Quantum States Match!

======================================================================
VERIFICATION (Mahout vs Qiskit)
======================================================================
Max Probability Difference: 4.94e-12
Max Amplitude Difference:   2.33e-10
>> SUCCESS: Quantum States Match!

======================================================================
VERIFICATION (PennyLane vs Qiskit)
======================================================================
Max Probability Difference: 4.94e-12
Max Amplitude Difference:   2.33e-10
>> SUCCESS: Quantum States Match!

@guan404ming
Copy link
Member Author

Point needs improved and possible solution

  • Single Large Memory Allocation -> adopt a paged state vector requiring kernel changes
  • read_parquet_batch blocks -> introduce double-buffered async I/O to overlap CPU read with GPU compute and improve throughput.

Copy link

@400Ping 400Ping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

@400Ping
Copy link

400Ping commented Dec 8, 2025

Point needs improved and possible solution

  • Single Large Memory Allocation -> adopt a paged state vector requiring kernel changes
  • read_parquet_batch blocks -> introduce double-buffered async I/O to overlap CPU read with GPU compute and improve throughput.

I could help with this if you need.

@rich7420
Copy link
Contributor

rich7420 commented Dec 9, 2025

LGTM !
here's my result

======================================================================
E2E BENCHMARK: 16 Qubits, 2000 Samples
======================================================================

[PennyLane] Full Pipeline (Disk -> GPU)...
  IO Time: 1.5408 s
/home/rich-wsl/mahout/qdp/qdp-python/../benchmark/benchmark_e2e_final.py:207: UserWarning: Casting complex values to real discards the imaginary part (Triggered internally at ../aten/src/ATen/native/Copy.cpp:301.)
  state_gpu = state_cpu.to("cuda", dtype=torch.float32)
  Total Time: 2.6046 s

[Qiskit] Full Pipeline (Disk -> GPU)...
  IO Time: 1.3602 s
    Processed 10/16 vectors...
  Total Time: 314.1786 s

[Mahout] Full Pipeline (Disk -> GPU)...
  Parquet->GPU (IO+Encode): 1.5916 s
  DLPack conversion: 0.0002 s
  Reshape & convert: 0.0870 s
  Total Time: 1.7024 s

======================================================================
E2E LATENCY (Lower is Better)
Samples: 2000, Qubits: 16
======================================================================
Mahout           1.7024 s
PennyLane        2.6046 s
Qiskit         314.1786 s
----------------------------------------------------------------------
Speedup vs PennyLane:       1.53x
Speedup vs Qiskit:        184.56x

======================================================================
VERIFICATION (Mahout vs PennyLane)
======================================================================
Max Probability Difference: 1.21e-18
Max Amplitude Difference:   8.93e-17
>> SUCCESS: Quantum States Match!

======================================================================
VERIFICATION (Mahout vs Qiskit)
======================================================================
Max Probability Difference: 4.96e-12
Max Amplitude Difference:   2.33e-10
>> SUCCESS: Quantum States Match!

======================================================================
VERIFICATION (PennyLane vs Qiskit)
======================================================================
Max Probability Difference: 4.96e-12
Max Amplitude Difference:   2.33e-10
>> SUCCESS: Quantum States Match!

@rich7420
Copy link
Contributor

rich7420 commented Dec 9, 2025

  • Single Large Memory Allocation -> adopt a paged state vector requiring kernel changes
  • read_parquet_batch blocks -> introduce double-buffered async I/O to overlap CPU read with GPU compute and improve throughput.

I think this part should be like Iterator way to fit every size of RAM and prevent OOM at same time.

@guan404ming
Copy link
Member Author

I could help with this if you need.

Thanks for the kind words! I’ve already implemented some related parts, but really appreciate your willingness to help. Once I wrap things up, I’d be happy to have you review them.

@guan404ming
Copy link
Member Author

guan404ming commented Dec 9, 2025

I think this part should be like Iterator way to fit every size of RAM and prevent OOM at same time.

I am not that really familiar with this part, maybe you could help with this, thanks!

@guan404ming guan404ming merged commit ce4b7ca into apache:dev-qdp Dec 9, 2025
2 checks passed
@guan404ming guan404ming deleted the optimized-batch-kernel branch December 9, 2025 11:05
@guan404ming
Copy link
Member Author

Merge, feel free to open pr to refine this one, thanks for all review!

guan404ming added a commit to guan404ming/mahout that referenced this pull request Dec 11, 2025
* [QDP] Add batch encoding support

* Refactor batch pre-processing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants