Add OOCMatrix API for out-of-core operations with NumPy/SciPy backend orchestration #39

Copilot · 2025-10-28T02:31:19Z

Provides high-level API for out-of-core matrix operations that wraps existing NumPy/SciPy operations rather than reimplementing mathematical kernels. Focuses on orchestration: lazy evaluation, block-wise processing, buffer management, and streaming.

New API

OOCMatrix - NumPy-like wrapper for out-of-core operations:

Block-wise operations: blockwise_apply(), blockwise_reduce(), iterate_blocks()
Matrix operations: matmul() with custom operation support, lazy operators (+, *, @)
Statistics: sum(), mean(), std(), min(), max() computed block-wise
Integrates with existing Plan infrastructure for lazy evaluation and operator fusion

Example

from paper import OOCMatrix
import numpy as np

# Load large matrices (doesn't load full data into RAM)
A = OOCMatrix('fileA.bin', shape=(10_000_000, 1000))
B = OOCMatrix('fileB.bin', shape=(1000, 1000))

# Matrix multiply using NumPy dot for in-block ops
C = A.matmul(B, op=np.dot)

# Stream results block-by-block
for block, (r, c) in C.iterate_blocks():
    process(block)  # Each block is plain NumPy array

# Statistics computed without full load
mean = A.mean()
std = A.std()

# Apply transformations using existing NumPy ops
A_norm = A.blockwise_apply(lambda x: (x - mean) / std)

# Lazy evaluation with automatic fusion
result = (A + B) * 2  # Builds plan, fuses ops
materialized = result.compute('output.bin')  # Executes

Implementation

paper/operators.py: OOCMatrix class (380 lines)
tests/test_operators.py: 16 tests covering API surface
examples_oocmatrix.py: 6 examples demonstrating usage patterns
OPERATORS.md: API reference and design rationale
README.md: Quick start guide

All mathematical operations delegate to NumPy/SciPy. Framework handles block orchestration, lazy DAG building, buffer management, and scheduling only.

Original prompt

This section details on the original issue you should resolve

<issue_title>operators support to reach full ML algorithm capabilities - scaffold</issue_title>
<issue_description>we do not reimplement every mathematical operation from scratch to make the OOC framework useful and production-ready. The real gap is how operations are orchestrated / this out-of-core frameworks primarily need to:

Wrap existing NumPy/SciPy/Pandas operations with lazy evaluation, block-wise processing, and streaming constructs.

Chunk large datasets and use existing optimized libraries for computations inside each chunk/block.

Intercept user code at the API level to provide out-of-core support transparently -> for example, providing a NumPy-like API that does not load the full matrix/array into RAM, but instead dynamically loads blocks and applies operations as needed.

How an Ideal Implementation Looks

Matrix multiplication (@), element-wise operations, reductions (sum, mean), slicing, filtering, and custom function applications can utilize existing backends (NumPy, SciPy, CuPy).

The OOC framework focuses on providing smart block loading, buffer management, iteration control, compression, eviction, and prefetching.

Only a small set of custom logic must be written for block-wise orchestration, lazy DAG building, and scheduling / while math kernels stay fully optimized using current libraries.

Example: Block-wise Wrapper (No Kernel Rewrite Needed)
from ooc import OOCMatrix  # Your wrapper
import numpy as np

# User applies existing NumPy/SciPy operations on each block automatically
A = OOCMatrix('fileA.h5', shape=(10_000_000, 1000))
B = OOCMatrix('fileB.h5', shape=(1000, 1000))

# This triggers block-wise multiplication, but each block operation is plain NumPy
def matmul_blocks(A_block, B_block):
    return np.dot(A_block, B_block)

# API exposes big operations -- no kernel rewrite required!
C = A.matmul(B, op=matmul_blocks)

for block, idx in C.iterate_blocks():
    # downstream systems can consume each result block
    process(block)

# Other ops, e.g., sum, mean, normalization:
mean = A.blockwise_reduce(np.mean)
A_normalized = A.blockwise_apply(lambda x: (x-mean)/np.std(x))
Where you do sometimes need custom logic: For advanced DAG optimization (operator fusion, prescient eviction, smart scheduling), we do need custom Python code / but not for basic matrix, linear algebra, or statistical computations.

Proven Practice

Successful frameworks (Dask, Vaex, RAPIDS) leverage existing libraries for in-block ops and focus on orchestration, DAG, scheduling, and buffer management.

Conclusion

we do not need to reimplement fundamental matrix and array ops if you architect the OOC framework to reuse mature libraries for in-core computations. The main engineering lift is building the block orchestration, IO scheduling, and memory management, which leverages and extends / not replaces / existing scientific Python code.</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes operators support to reach full ML algorithm capabilities - scaffold #38

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: j143 <53068787+j143@users.noreply.github.com>

Initial plan

cc65b20

Copilot AI assigned Copilot and j143 Oct 28, 2025

Copilot started work on behalf of j143 October 28, 2025 02:31 View session

Copilot AI and others added 3 commits October 28, 2025 02:37

Add OOCMatrix wrapper with operators support and comprehensive tests

4b3d487

Co-authored-by: j143 <53068787+j143@users.noreply.github.com>

Update README with OOCMatrix documentation and examples

1d3f538

Co-authored-by: j143 <53068787+j143@users.noreply.github.com>

Add detailed operator documentation and fix README issues

640a474

Co-authored-by: j143 <53068787+j143@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add operators support for ML algorithm capabilities~~ Add OOCMatrix API for out-of-core operations with NumPy/SciPy backend orchestration Oct 28, 2025

Copilot AI requested a review from j143 October 28, 2025 02:45

Copilot finished work on behalf of j143 October 28, 2025 02:45

j143 closed this Nov 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OOCMatrix API for out-of-core operations with NumPy/SciPy backend orchestration #39

Add OOCMatrix API for out-of-core operations with NumPy/SciPy backend orchestration #39

Uh oh!

Copilot AI commented Oct 28, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add OOCMatrix API for out-of-core operations with NumPy/SciPy backend orchestration #39

Add OOCMatrix API for out-of-core operations with NumPy/SciPy backend orchestration #39

Uh oh!

Conversation

Copilot AI commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New API

Example

Implementation

How an Ideal Implementation Looks

Proven Practice

Conclusion

Comments on the Issue (you are @copilot in this section)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Oct 28, 2025 •

edited

Loading