Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Nov 20, 2025

Enables seamless conversion of memory-mapped Paper arrays to device tensors for deep learning workflows. Minimizes memory usage through zero-copy conversion where possible.

Changes

Core API (paper/numpy_api.py)

  • ndarray.to_torch(device=None) - Converts to PyTorch tensor with optional device specification
    • Zero-copy for writable memmaps on CPU
    • Handles read-only memmaps with minimal copy overhead
    • Supports CUDA device transfer
  • ndarray.to_tensorflow() - Converts to TensorFlow tensor via buffer protocol
  • Both methods compute lazy arrays automatically before conversion

Testing (tests/test_tensor_conversion.py)

  • 17 test cases covering materialized/lazy arrays, CPU/GPU devices, error handling
  • Validates zero-copy behavior and computation correctness

Documentation

  • Demo showing complete out-of-core workflow (examples/tensor_conversion_demo.py)
  • Updated README with API reference and usage patterns

Usage

from paper import numpy_api as pnp

# Load large array with memory mapping
arr = pnp.load("matrix.dat", shape=(10000, 10000), dtype=np.float32)

# Build lazy computation
result = (arr * 2).compute()

# Convert efficiently to device tensors
torch_cpu = result.to_torch()              # zero-copy for writable arrays
torch_gpu = result.to_torch(device='cuda') # copy to GPU
tf_tensor = result.to_tensorflow()          # buffer protocol conversion

Notes

  • PyTorch and TensorFlow are optional dependencies (graceful ImportError)
  • Zero-copy optimization applies to writable memory-mapped arrays on CPU
  • Lazy arrays are materialized once during conversion
Original prompt

This section details on the original issue you should resolve

<issue_title>direct conversion of out-of-core arrays to device tensors, support for DLPack or memory mapping</issue_title>
<issue_description>. Direct Conversion of Out-of-Core Arrays to Device Tensors

Allow users to seamlessly convert large arrays managed out-of-core (on disk, not in RAM) into tensors compatible with PyTorch, TensorFlow, JAX, or other frameworks that run on GPU or other devices.

The conversion should avoid unnecessary data copies, enabling direct access by deep learning libraries for maximum performance.

Key Goals:

  1. Minimize memory usage: Only the relevant chunk/page/batch should be loaded into device memory.
  2. Zero-copy or memory-mapped access: Whenever possible, avoid copying data by sharing memory pointers (especially for Numpy/PyTorch/CUDF integrations).

API Example:

from paper import numpy_api as pnp
import numpy as np

# Load a large out-of-core array (no data read yet)
arr = pnp.load("large_matrix.dat", shape=(10000, 10000), dtype=np.float32)

# Build your computation graph (lazy, nothing loaded)
c = arr * 2 + 4

# Execute the computation plan – now you get a result object
result = c.compute()

# Convert result to PyTorch and TensorFlow tensors
torch_tensor = result.to_torch()      # Efficient, memory-mapped conversion
tf_tensor = result.to_tensorflow()    # Efficient, memory-mapped conversion

Memory mapping: Map disk data into RAM (using mmap), then wrap as tensor without physical copy.

Tests
Add unit tests</issue_description>

Comments on the Issue (you are @copilot in this section)


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 3 commits November 20, 2025 01:44
Co-authored-by: j143 <53068787+j143@users.noreply.github.com>
Co-authored-by: j143 <53068787+j143@users.noreply.github.com>
Co-authored-by: j143 <53068787+j143@users.noreply.github.com>
Copilot AI changed the title [WIP] Add direct conversion of out-of-core arrays to device tensors Add direct conversion of out-of-core arrays to PyTorch and TensorFlow tensors Nov 20, 2025
Copilot AI requested a review from j143 November 20, 2025 01:50
Copy link
Owner

@j143 j143 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid implementation overall with thoughtful design. However, before merging, I'd like clarification on several critical questions:

API DESIGN & STABILITY

  1. Public API Status: Should to_torch() and to_tensorflow() be marked experimental in docstrings? First-cut APIs often need breaking changes.

  2. Zero-Copy Preconditions: Document EXACTLY when zero-copy is guaranteed:

    • Writable memmap on CPU: ✓
    • Read-only memmap: falls back to copy
    • Non-contiguous/sliced arrays: unclear
    • Alignment, endianness requirements?

PYTORCH-SPECIFIC CONCERNS

  1. CUDA Safety (line 231): cpu_tensor.to(device) can fail silently if CUDA unavailable. Need error handling BEFORE .to() call.

  2. Memmap Tensor Lifetime: What happens if underlying file is closed/resized while tensor is in use? Document constraints for training loops.

  3. Gradient Flow: Test that zero-copy tensors support requires_grad=True. Current tests don't verify this.

  4. Non-Contiguous Arrays: Add tests for transposed/sliced arrays to verify torch.from_numpy() semantics.

TENSORFLOW CONCERNS

  1. Buffer Protocol: Which TF API is used? Test both eager and graph execution modes.

  2. Device Placement: Always CPU? Should match PyTorch with device parameter?

  3. Dtype Coverage: Tests only check float32. Need float16, int64, bool, complex.

TESTING GAPS

  1. Lazy Array Behavior: Do you cache materialized result or recompute on subsequent calls? Test both .to_torch() on same lazy array.

  2. Interleaved Operations: No test for .compute() -> .to_torch() sequences.

  3. Error Modes: Invalid device specs, undersized tensors, unsupported dtypes.

DOCUMENTATION

  1. Failure Examples: Add README example showing error when GPU unavailable.

  2. Copy Semantics: Clearly document when zero-copy vs. actual copy happens.

This is good foundational work with solid error handling. These questions are mainly about clarifying API contracts before this becomes stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

direct conversion of out-of-core arrays to device tensors, support for DLPack or memory mapping

2 participants