Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Oct 31, 2025

The framework had infrastructure benchmarking (matrix multiplication timing) but lacked end-to-end ML problem-solving with actionable results. This adds a complete gene expression classification workflow that implements ML algorithms using Paper's out-of-core operators instead of external ML libraries, demonstrating Paper can handle ML workloads directly.

Changes

Paper ML Module (paper_ml.py) - NEW

  • Linear regression and logistic regression implemented using only Paper's operators
  • Gradient descent training using matrix operations: @ (matmul), .T (transpose), * (scalar mult), + (add), - (sub)
  • No external ML libraries for algorithms (scikit-learn only for utilities like train_test_split)

New Operator Added (paper/numpy_api.py)

  • Subtraction operator (-) implemented as A - B = A + (-1 * B)
  • Enables gradient descent: weights = weights - learning_rate * gradient

Core ML Pipeline (ml_classification.py)

  • Disease/control binary classification using Paper's logistic regression implementation
  • Data loading via Paper or Dask frameworks
  • Evaluation metrics: accuracy, ROC AUC
  • Benchmarks Paper's ML workload capability, not just data loading

Example & Documentation

  • examples/ml_classification_example.py - Standalone demonstration
  • ML_TASK.md - Complete workflow documentation updated to emphasize Paper operators
  • Extended demo_real_dataset.py with ML step

Testing

  • 10 new tests covering Paper's ML implementations (label generation, training, evaluation)
  • All 84 tests passing

Usage

# Run classification with Paper's operators
python ml_classification.py \
  --data-path gene_expression.dat \
  --shape 5000 5000 \
  --iterations 20 \
  --learning-rate 0.01

# Compare frameworks on ML task (both use Paper operators)
python ml_classification.py \
  --data-path gene_expression.dat \
  --shape 5000 5000 \
  --hdf5-path data.hdf5 \
  --compare

Output includes accuracy, ROC AUC, and timing - demonstrating Paper's ML computation capability and solution quality.

What This Achieves

ML algorithms run on Paper's operators - not external libraries
Benchmarks Paper's ML workload - matrix ops for gradient descent
Demonstrates out-of-core ML capability - Paper handles ML computation
Minimal framework changes - only subtraction operator added

This implementation shows Paper is more than a data loading framework - it can perform ML computations using its own out-of-core operators.

Dependencies

scikit-learn for utilities only (train_test_split, metrics), dask[array] and psutil for benchmarking (already used by existing benchmark scripts). ML algorithms implemented using Paper's operators.

Original prompt

This section details on the original issue you should resolve

<issue_title>test with an ML task</issue_title>
<issue_description>Current Capabilities

  • Dataset Integration: Adds realistic gene expression datasets (synthetic but structurally identical to real RNA-seq data) with biological characteristics [gene modules, log-normal distribution, etc.].1
  • Benchmarking: Provides a direct performance comparison (matrix multiplication and I/O) of Paper vs. Dask using identical real datasets, with multi-run stats output for time, memory, and CPU usage.
  • Data Prep Pipeline: Supports generation and conversion from HDF5, NumPy, CSV, etc., and robust validation. Utilities and CLI interfaces included.1
  • Testing/Demo: Has comprehensive data integrity/shape/content tests and an end-to-end demo script to walk through generation, validation, benchmarking, and result reporting.

What's Missing to Solve a Kaggle-like Problem?

  • The current integration focuses on infrastructure benchmarking: It shows how quickly and efficiently each framework multiplies large matrices representing gene expression, but does not yet provide end-to-end problem-solving for:
    • Prediction: Classifying samples (e.g., disease vs. control), regression, clustering, or any biological analysis typical in Kaggle competitions.
    • Actionable Output: No scoring against a ground truth, display of accuracy, leaderboard, etc.

Concrete Next Steps/Recommendations

  1. Choose a Standard ML Task
    • Implement an example problem scenario using the real dataset, e.g.:
      • Gene expression classification: Disease prediction from expression vectors (supervised ML)
      • Clustering: Discover gene modules/groups
      • Dimensionality reduction (PCA/SVD)
    • Identify a public (or simulated) target variable to predict, so you can report metrics like accuracy, ROC AUC, or cluster purity. Load a small real Kaggle dataset (for testing) if feasible.
  2. Actionable Workflow Implementation
    • Extend the demo to run an actual ML pipeline:
      • Data prep (already present)
      • Fit a model (scikit-learn, XGBoost, Dask-ML, or similar)
      • Evaluate and display performance compared to Dask (or other frameworks)
      • Report runtime and solution quality — “not just computation.”
  3. Scenario Documentation
    • Provide a worked example or notebook with:
      • Detailed workflow description (“Solve disease classification using gene expression data”)
      • Steps: prepare, train, score, compare (Paper vs Dask)
      • Actionable results: “Paper model scored 90% accuracy in 80s vs Dask 85% in 90s”

Summary Table: Required vs Present

Capability Implemented in PR Still Needed
Matrix computation benchmarking Yes
Realistic data integration Yes
ML problem scenario No Needs classification/regression example
Actionable results (accuracy etc.) No Must report solution quality
Comparison with Dask on problem Partial Complete with real use-case metrics

<agent_instructions>use gene expression classification only! and do the task. choose minimal framework changes and lean testing.</agent_instructions>

Comments on the Issue (you are @copilot in this section)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Footnotes

  1. https://github.com/j143/ooc/pull/41/files 2

Copilot AI and others added 3 commits October 31, 2025 13:48
Co-authored-by: j143 <53068787+j143@users.noreply.github.com>
Co-authored-by: j143 <53068787+j143@users.noreply.github.com>
Co-authored-by: j143 <53068787+j143@users.noreply.github.com>
Copilot AI changed the title [WIP] Add testing capabilities for ML task integration Add gene expression classification ML task with actionable metrics Oct 31, 2025
Copilot AI requested a review from j143 October 31, 2025 14:01
@j143
Copy link
Owner

j143 commented Oct 31, 2025

@copilot - right now, we are using ML from another framework. but we need to test the ML workload also, choose the lean option to implement (missing operators) and run the algorithm -- choose linear regression or low hanging fruit to do ML workload as much as possible with this ooc to benchmark.

our goal doesn't replace specialized libraries like pytorch, numpy, scipy but supplement by simple changes!

Copy link
Contributor Author

Copilot AI commented Oct 31, 2025

@copilot - right now, we are using ML from another framework. but we need to test the ML workload also, choose the lean option to implement (missing operators) and run the algorithm -- choose linear r...

Understood. I'll implement linear regression directly using Paper's operators (matrix multiplication, addition, scalar multiplication, transpose) instead of using scikit-learn. This will benchmark Paper's ML workload capabilities.

I'll add the missing subtraction operator and implement gradient descent-based linear regression as a lean solution. This keeps changes minimal while demonstrating ML workload on Paper's out-of-core framework.

Co-authored-by: j143 <53068787+j143@users.noreply.github.com>
Copilot AI changed the title Add gene expression classification ML task with actionable metrics Implement ML workload using Paper's operators with gene expression classification Oct 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test with an ML task

2 participants