Skip to content

Conversation

@bbopen
Copy link
Owner

@bbopen bbopen commented Jan 21, 2026

Summary

Fixes Arrow encoding failure for multi-dimensional ndarrays (e.g., torch tensors) by implementing a flatten-on-encode, reshape-on-decode strategy.

Problem

PyArrow's pa.array() only handles 1D arrays. Multi-dimensional arrays like 2D torch tensors failed with:

pyarrow.lib.ArrowInvalid: only handle 1-dimensional arrays

This was causing the codec-suite CI to fail.

Solution

Python Side

  • Flatten multi-dimensional arrays before pa.array()
  • Include original shape in envelope for JS-side reconstruction

JavaScript Side

  • Add reshapeArray() utility for reconstructing multi-dim arrays
  • Update ndarray decoder to reshape after Arrow decode when shape.length > 1

Why This Approach

Approach Pros Cons
Flatten + Reshape Works today, maintains Arrow efficiency Extra reshape step
FixedShapeTensorArray Arrow-native JS doesn't support it
JSON fallback Simple ~10x larger payloads

Test Results

✓ serializes scipy sparse matrices
✓ serializes torch tensors        # Was failing, now passes with Arrow
✓ serializes sklearn estimators
✓ handles 1D arrays (no reshape needed)
✓ handles 3D arrays
✓ handles 3D torch tensors with Arrow encoding
✓ handles single-element arrays
✓ handles single-element multi-dimensional arrays
✓ preserves dtype for float arrays
✓ handles 4D tensors (image-like batches)

Test Plan

  • Torch tensor test passes with Arrow encoding (no JSON fallback)
  • 2D tensor [[1,2],[3,4]] round-trips correctly with shape [2,2]
  • 1D arrays pass through without reshape
  • 3D arrays (2x3x4) flatten and reshape correctly
  • 4D tensors (image-like 2x2x2x2) handled correctly
  • Single-element and single-element multi-dim arrays work
  • Float dtypes preserved
  • Full test suite passes (1232 tests)
  • Build succeeds

Closes #161

🤖 Generated with Claude Code

PyArrow's pa.array() only handles 1D arrays, causing torch tensor
serialization to fail. This implements a flatten-on-encode,
reshape-on-decode strategy that maintains Arrow's binary efficiency.

## Changes

### Python Side (runtime/python_bridge.py)
- Flatten multi-dimensional arrays before pa.array()
- Include original shape in envelope for JS-side reconstruction
- Add dtype to envelope for completeness

### JavaScript Side (src/utils/codec.ts)
- Add reshapeArray() utility for reconstructing multi-dim arrays
- Update ndarray decoder to reshape Arrow-decoded data when shape.length > 1
- Handle both sync and async Arrow decode paths

### Tests
- Restore Arrow encoding for torch tensor test (remove JSON fallback)
- Verify 2D tensor [[1,2],[3,4]] round-trips correctly

## Why This Approach

- Arrow IPC is ~10x more efficient than JSON for numeric data
- arrow-js doesn't support FixedShapeTensorArray yet (open issue since 2020)
- Flatten+reshape is forward-compatible with future tensor support
- See: apache/arrow-js#115

Closes #161

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Jan 21, 2026

Warning

Rate limit exceeded

@bbopen has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 2 minutes and 25 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 2909293 and 3723f05.

📒 Files selected for processing (3)
  • runtime/python_bridge.py
  • src/utils/codec.ts
  • test/runtime_codec_scientific.test.ts

Comment @coderabbitai help to get the list of available commands and usage tips.

Add 7 new tests for ndarray flatten+reshape functionality:
- 1D arrays (no reshape needed)
- 3D arrays via numpy
- 3D torch tensors with Arrow encoding
- Single-element arrays
- Single-element multi-dimensional arrays
- Float dtype preservation
- 4D tensors (image-like batches)

These tests verify the Arrow encoding path where multi-dimensional
arrays are flattened on Python side and reshaped on JS side.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Owner Author

@bbopen bbopen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@bbopen bbopen merged commit ac735a5 into main Jan 21, 2026
19 of 20 checks passed
@bbopen bbopen deleted the fix/arrow-multidim-ndarray branch January 21, 2026 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Arrow encoding fails for multi-dimensional ndarrays (torch tensors)

2 participants