Implement ML workload using Paper's operators with gene expression classification #43

Copilot · 2025-10-31T13:40:32Z

The framework had infrastructure benchmarking (matrix multiplication timing) but lacked end-to-end ML problem-solving with actionable results. This adds a complete gene expression classification workflow that implements ML algorithms using Paper's out-of-core operators instead of external ML libraries, demonstrating Paper can handle ML workloads directly.

Changes

Paper ML Module (paper_ml.py) - NEW

Linear regression and logistic regression implemented using only Paper's operators
Gradient descent training using matrix operations: @ (matmul), .T (transpose), * (scalar mult), + (add), - (sub)
No external ML libraries for algorithms (scikit-learn only for utilities like train_test_split)

New Operator Added (paper/numpy_api.py)

Subtraction operator (-) implemented as A - B = A + (-1 * B)
Enables gradient descent: weights = weights - learning_rate * gradient

Core ML Pipeline (ml_classification.py)

Disease/control binary classification using Paper's logistic regression implementation
Data loading via Paper or Dask frameworks
Evaluation metrics: accuracy, ROC AUC
Benchmarks Paper's ML workload capability, not just data loading

Example & Documentation

examples/ml_classification_example.py - Standalone demonstration
ML_TASK.md - Complete workflow documentation updated to emphasize Paper operators
Extended demo_real_dataset.py with ML step

Testing

10 new tests covering Paper's ML implementations (label generation, training, evaluation)
All 84 tests passing

Usage

# Run classification with Paper's operators
python ml_classification.py \
  --data-path gene_expression.dat \
  --shape 5000 5000 \
  --iterations 20 \
  --learning-rate 0.01

# Compare frameworks on ML task (both use Paper operators)
python ml_classification.py \
  --data-path gene_expression.dat \
  --shape 5000 5000 \
  --hdf5-path data.hdf5 \
  --compare

Output includes accuracy, ROC AUC, and timing - demonstrating Paper's ML computation capability and solution quality.

What This Achieves

✅ ML algorithms run on Paper's operators - not external libraries
✅ Benchmarks Paper's ML workload - matrix ops for gradient descent
✅ Demonstrates out-of-core ML capability - Paper handles ML computation
✅ Minimal framework changes - only subtraction operator added

This implementation shows Paper is more than a data loading framework - it can perform ML computations using its own out-of-core operators.

Dependencies

scikit-learn for utilities only (train_test_split, metrics), dask[array] and psutil for benchmarking (already used by existing benchmark scripts). ML algorithms implemented using Paper's operators.

Fixes test with an ML task #42

Original prompt

This section details on the original issue you should resolve

<issue_title>test with an ML task</issue_title>
<issue_description>Current Capabilities

Dataset Integration: Adds realistic gene expression datasets (synthetic but structurally identical to real RNA-seq data) with biological characteristics [gene modules, log-normal distribution, etc.].¹

Benchmarking: Provides a direct performance comparison (matrix multiplication and I/O) of Paper vs. Dask using identical real datasets, with multi-run stats output for time, memory, and CPU usage.

Data Prep Pipeline: Supports generation and conversion from HDF5, NumPy, CSV, etc., and robust validation. Utilities and CLI interfaces included.¹

Testing/Demo: Has comprehensive data integrity/shape/content tests and an end-to-end demo script to walk through generation, validation, benchmarking, and result reporting.

What's Missing to Solve a Kaggle-like Problem?

The current integration focuses on infrastructure benchmarking: It shows how quickly and efficiently each framework multiplies large matrices representing gene expression, but does not yet provide end-to-end problem-solving for:

Prediction: Classifying samples (e.g., disease vs. control), regression, clustering, or any biological analysis typical in Kaggle competitions.

Actionable Output: No scoring against a ground truth, display of accuracy, leaderboard, etc.

Concrete Next Steps/Recommendations

Choose a Standard ML Task

Implement an example problem scenario using the real dataset, e.g.:

Gene expression classification: Disease prediction from expression vectors (supervised ML)

Clustering: Discover gene modules/groups

Dimensionality reduction (PCA/SVD)

Identify a public (or simulated) target variable to predict, so you can report metrics like accuracy, ROC AUC, or cluster purity. Load a small real Kaggle dataset (for testing) if feasible.

Actionable Workflow Implementation

Extend the demo to run an actual ML pipeline:

Data prep (already present)

Fit a model (scikit-learn, XGBoost, Dask-ML, or similar)

Evaluate and display performance compared to Dask (or other frameworks)

Report runtime and solution quality — “not just computation.”

Scenario Documentation

Provide a worked example or notebook with:

Detailed workflow description (“Solve disease classification using gene expression data”)

Steps: prepare, train, score, compare (Paper vs Dask)

Actionable results: “Paper model scored 90% accuracy in 80s vs Dask 85% in 90s”

Summary Table: Required vs Present

Capability Implemented in PR Still Needed

Matrix computation benchmarking Yes

Realistic data integration Yes

ML problem scenario No Needs classification/regression example

Actionable results (accuracy etc.) No Must report solution quality

Comparison with Dask on problem Partial Complete with real use-case metrics

<agent_instructions>use gene expression classification only! and do the task. choose minimal framework changes and lean testing.</agent_instructions>

Comments on the Issue (you are @copilot in this section)

Fixes test with an ML task #42

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

https://github.com/j143/ooc/pull/41/files ↩ ↩²

Co-authored-by: j143 <53068787+j143@users.noreply.github.com>

j143 · 2025-10-31T19:10:13Z

@copilot - right now, we are using ML from another framework. but we need to test the ML workload also, choose the lean option to implement (missing operators) and run the algorithm -- choose linear regression or low hanging fruit to do ML workload as much as possible with this ooc to benchmark.

our goal doesn't replace specialized libraries like pytorch, numpy, scipy but supplement by simple changes!

Copilot · 2025-10-31T19:12:47Z

@copilot - right now, we are using ML from another framework. but we need to test the ML workload also, choose the lean option to implement (missing operators) and run the algorithm -- choose linear r...

Understood. I'll implement linear regression directly using Paper's operators (matrix multiplication, addition, scalar multiplication, transpose) instead of using scikit-learn. This will benchmark Paper's ML workload capabilities.

I'll add the missing subtraction operator and implement gradient descent-based linear regression as a lean solution. This keeps changes minimal while demonstrating ML workload on Paper's out-of-core framework.

Co-authored-by: j143 <53068787+j143@users.noreply.github.com>

Initial plan

aa9fb94

Copilot AI assigned Copilot and j143 Oct 31, 2025

Copilot started work on behalf of j143 October 31, 2025 13:41 View session

Copilot AI and others added 3 commits October 31, 2025 13:48

Add gene expression classification ML task

87b9b01

Co-authored-by: j143 <53068787+j143@users.noreply.github.com>

Add ML example, documentation, and update dependencies

6c28eb0

Co-authored-by: j143 <53068787+j143@users.noreply.github.com>

Address code review feedback and clarify dependencies

ab7a865

Co-authored-by: j143 <53068787+j143@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add testing capabilities for ML task integration~~ Add gene expression classification ML task with actionable metrics Oct 31, 2025

Copilot AI requested a review from j143 October 31, 2025 14:01

Copilot finished work on behalf of j143 October 31, 2025 14:01

Copilot started work on behalf of j143 October 31, 2025 19:10 View session

Implement ML using Paper's operators instead of scikit-learn

4618e43

Co-authored-by: j143 <53068787+j143@users.noreply.github.com>

Copilot AI changed the title ~~Add gene expression classification ML task with actionable metrics~~ Implement ML workload using Paper's operators with gene expression classification Oct 31, 2025

Copilot finished work on behalf of j143 October 31, 2025 19:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement ML workload using Paper's operators with gene expression classification #43

Implement ML workload using Paper's operators with gene expression classification #43

Uh oh!

Copilot AI commented Oct 31, 2025 •

edited

Loading

Uh oh!

j143 commented Oct 31, 2025

Uh oh!

Copilot AI commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Capability	Implemented in PR	Still Needed
Matrix computation benchmarking	Yes
Realistic data integration	Yes
ML problem scenario	No	Needs classification/regression example
Actionable results (accuracy etc.)	No	Must report solution quality
Comparison with Dask on problem	Partial	Complete with real use-case metrics

Implement ML workload using Paper's operators with gene expression classification #43

Are you sure you want to change the base?

Implement ML workload using Paper's operators with gene expression classification #43

Uh oh!

Conversation

Copilot AI commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Usage

What This Achieves

Dependencies

Comments on the Issue (you are @copilot in this section)

Footnotes

Uh oh!

j143 commented Oct 31, 2025

Uh oh!

Copilot AI commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Oct 31, 2025 •

edited

Loading