Skip to content

Arrow encoding fails for multi-dimensional ndarrays (torch tensors) #161

@bbopen

Description

@bbopen

Problem

serialize_ndarray() in runtime/python_bridge.py uses pa.array(obj) which only handles 1-dimensional arrays. Multi-dimensional arrays (e.g., 2D torch tensors) fail with:

pyarrow.lib.ArrowInvalid: only handle 1-dimensional arrays

This causes the codec-suite CI to fail on the torch tensor serialization test.

Root Cause

  • PyArrow's pa.array() is designed for columnar (1D) data
  • PyArrow has FixedShapeTensorArray for multi-dimensional data, but arrow-js doesn't support it yet (open since 2020)
  • The test was previously using JSON fallback which worked, but ADR-002 migration removed the fallback to test Arrow encoding

Solution: Flatten + Reshape

Implement a flatten-on-encode, reshape-on-decode strategy:

Python Side (runtime/python_bridge.py)

def serialize_ndarray(obj):
    # Flatten multi-dimensional arrays for Arrow compatibility
    # pa.array() only handles 1D arrays
    flat = obj.flatten() if obj.ndim > 1 else obj
    arr = pa.array(flat)
    # ... Arrow IPC serialization ...
    return {
        '__tywrap__': 'ndarray',
        'codecVersion': CODEC_VERSION,
        'encoding': 'arrow',
        'b64': b64,
        'shape': list(obj.shape),  # Original shape for reconstruction
        'dtype': str(obj.dtype),
    }

JavaScript Side (src/runtime/safe-codec.ts)

// In ndarray decoder, after Arrow decode:
if (envelope.shape && envelope.shape.length > 1) {
  // Reshape flat array back to original dimensions
  data = reshapeArray(data, envelope.shape);
}

Benefits

  1. Maintains Arrow efficiency - Binary encoding is ~10x smaller than JSON for numeric data
  2. Works with current arrow-js - No dependency on unimplemented tensor support
  3. Forward-compatible - Can switch to FixedShapeTensorArray when arrow-js supports it
  4. Zero-copy potential - Arrow IPC supports memory-mapped reads

Future Consideration

Consider contributing FixedShapeTensorArray support to apache/arrow-js as an upstream contribution.

Test Case

// 2D torch tensor that previously failed
const result = await bridge.call('torch', 'tensor', [[[1, 2], [3, 4]]]);
expect(result.shape).toEqual([2, 2]);
expect(result.data).toEqual([[1, 2], [3, 4]]);

Affected Files

  • runtime/python_bridge.py - Add flatten logic to serialize_ndarray()
  • src/runtime/safe-codec.ts - Add reshape logic to ndarray decoder
  • test/runtime_codec_scientific.test.ts - Remove JSON fallback workaround

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions