Skip to content

Add two regression datasets: California Housing and Diabetes#39

Open
Shu-Wan wants to merge 1 commit intomainfrom
feat/cb+-regression-datasets
Open

Add two regression datasets: California Housing and Diabetes#39
Shu-Wan wants to merge 1 commit intomainfrom
feat/cb+-regression-datasets

Conversation

@Shu-Wan
Copy link
Member

@Shu-Wan Shu-Wan commented Feb 6, 2026

Summary

This PR adds two classic regression datasets from sklearn to CausalBench for demo and testing purposes.

Datasets Added

1. California Housing Dataset

  • Samples: 20,640
  • Features: 9 (MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude, MedHouseVal)
  • Task: Regression - predict median house values in California districts
  • Source: sklearn.datasets.fetch_california_housing

2. Diabetes Dataset

  • Samples: 442
  • Features: 11 (age, sex, bmi, bp, s1-s6, target)
  • Task: Regression - predict disease progression from physiological variables
  • Source: sklearn.datasets.load_diabetes

Changes

Dataset Files

Each dataset includes:

  • CSV data file with all features and target
  • config.yaml configuration file following CausalBench schema
  • download_data.py script to regenerate data from sklearn

Deliverables

  • ✅ Two regression datasets in causalbench-asu/tests/data/
  • ✅ Compressed .zip files for each dataset
  • ✅ Updated README.md with dataset information

Design Decisions

  • No causal adjacency matrices: These are only required for causal discovery tasks, not regression tasks
  • Classic sklearn datasets: Well-defined, documented, appropriate size for demos
  • Standalone regression tasks: Configured explicitly as regression tasks in descriptions

Testing

All datasets successfully load through the CausalBench framework:

✅ California Housing Dataset: PASSED
✅ Diabetes Dataset: PASSED

Total: 2/2 tests passed

@kapkic
Copy link
Collaborator

kapkic commented Feb 6, 2026

LGTM

@Shu-Wan Shu-Wan marked this pull request as ready for review February 6, 2026 18:15
Copilot AI review requested due to automatic review settings February 6, 2026 18:15
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds two sklearn-sourced regression datasets (California Housing and Diabetes) to the CausalBench test data bundle for demos/testing, along with lightweight download scripts and README documentation updates.

Changes:

  • Added california_housing dataset config + regeneration script (and accompanying data/zip artifacts).
  • Added diabetes dataset config + CSV + regeneration script (and accompanying zip artifact).
  • Updated README dataset table; minor formatting cleanup in zip_files.py.

Reviewed changes

Copilot reviewed 7 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
causalbench-asu/tests/zip_files.py Minor formatting / quoting updates for zip utility.
causalbench-asu/tests/data/diabetes/download_data.py Script to regenerate the Diabetes CSV from sklearn.
causalbench-asu/tests/data/diabetes/diabetes_data.csv Added Diabetes dataset CSV.
causalbench-asu/tests/data/diabetes/config.yaml Added dataset config for Diabetes.
causalbench-asu/tests/data/diabetes.zip Added packaged dataset zip.
causalbench-asu/tests/data/california_housing/download_data.py Script to regenerate the California Housing CSV from sklearn.
causalbench-asu/tests/data/california_housing/config.yaml Added dataset config for California Housing.
README.md Updated dataset list table to include the new datasets + minor whitespace cleanup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +16 to +56
type: continuous
data: decimal
sex:
header: sex
type: continuous
data: decimal
bmi:
header: bmi
type: continuous
data: decimal
bp:
header: bp
type: continuous
data: decimal
s1:
header: s1
type: continuous
data: decimal
s2:
header: s2
type: continuous
data: decimal
s3:
header: s3
type: continuous
data: decimal
s4:
header: s4
type: continuous
data: decimal
s5:
header: s5
type: continuous
data: decimal
s6:
header: s6
type: continuous
data: decimal
target:
header: target
type: continuous
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column type values are set to continuous, but other dataset configs in this repo use values like ratio/nominal (e.g., tests/data/panama/config.yaml). If the dataset loader only recognizes the existing enum values, continuous will fail schema validation or parsing. Please align these column type fields with the accepted values used elsewhere (e.g., use ratio for numeric continuous variables) or update the loader/schema to explicitly support continuous.

Suggested change
type: continuous
data: decimal
sex:
header: sex
type: continuous
data: decimal
bmi:
header: bmi
type: continuous
data: decimal
bp:
header: bp
type: continuous
data: decimal
s1:
header: s1
type: continuous
data: decimal
s2:
header: s2
type: continuous
data: decimal
s3:
header: s3
type: continuous
data: decimal
s4:
header: s4
type: continuous
data: decimal
s5:
header: s5
type: continuous
data: decimal
s6:
header: s6
type: continuous
data: decimal
target:
header: target
type: continuous
type: ratio
data: decimal
sex:
header: sex
type: ratio
data: decimal
bmi:
header: bmi
type: ratio
data: decimal
bp:
header: bp
type: ratio
data: decimal
s1:
header: s1
type: ratio
data: decimal
s2:
header: s2
type: ratio
data: decimal
s3:
header: s3
type: ratio
data: decimal
s4:
header: s4
type: ratio
data: decimal
s5:
header: s5
type: ratio
data: decimal
s6:
header: s6
type: ratio
data: decimal
target:
header: target
type: ratio

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is important. @prat-man @Shu-Wan I don't recall we do have any "type" for datasets. We may need to remove them. Please confirm.

Comment on lines +16 to +48
type: continuous
data: decimal
HouseAge:
header: HouseAge
type: continuous
data: decimal
AveRooms:
header: AveRooms
type: continuous
data: decimal
AveBedrms:
header: AveBedrms
type: continuous
data: decimal
Population:
header: Population
type: continuous
data: decimal
AveOccup:
header: AveOccup
type: continuous
data: decimal
Latitude:
header: Latitude
type: continuous
data: decimal
Longitude:
header: Longitude
type: continuous
data: decimal
MedHouseVal:
header: MedHouseVal
type: continuous
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column type fields use continuous, which is inconsistent with existing dataset configs that use values like ratio/nominal (see tests/data/panama/config.yaml, tests/data/abalone/config.yaml). If continuous isn't a recognized schema value, this dataset config may not load. Please switch these to the supported type value (likely ratio for these numeric columns) or extend the schema/loader to accept continuous.

Suggested change
type: continuous
data: decimal
HouseAge:
header: HouseAge
type: continuous
data: decimal
AveRooms:
header: AveRooms
type: continuous
data: decimal
AveBedrms:
header: AveBedrms
type: continuous
data: decimal
Population:
header: Population
type: continuous
data: decimal
AveOccup:
header: AveOccup
type: continuous
data: decimal
Latitude:
header: Latitude
type: continuous
data: decimal
Longitude:
header: Longitude
type: continuous
data: decimal
MedHouseVal:
header: MedHouseVal
type: continuous
type: ratio
data: decimal
HouseAge:
header: HouseAge
type: ratio
data: decimal
AveRooms:
header: AveRooms
type: ratio
data: decimal
AveBedrms:
header: AveBedrms
type: ratio
data: decimal
Population:
header: Population
type: ratio
data: decimal
AveOccup:
header: AveOccup
type: ratio
data: decimal
Latitude:
header: Latitude
type: ratio
data: decimal
Longitude:
header: Longitude
type: ratio
data: decimal
MedHouseVal:
header: MedHouseVal
type: ratio

Copilot uses AI. Check for mistakes.
Comment on lines 60 to 70
## Support Benchmark Context
CausalBench is structured to support different machine learning tasks and dataset types. With user contribution, the supported context will be expanded, currently (as of 8/12/25), these models and tasks are provided.
CausalBench is structured to support different machine learning tasks and dataset types. With user contribution, the supported context will be expanded, currently (as of 8/12/25), these models and tasks are provided.

| Dataset | File | Description |
|-----------------------|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
| Abalone | data, static graph | |
| Adult | data, static graph | |
| Sachs | data, static graph | |
| California Housing | data | Regression dataset from sklearn with 20,640 samples predicting median house values in California districts |
| Diabetes | data | Regression dataset from sklearn with 442 samples predicting disease progression from physiological variables |
| NetSim | data, static graph | Brain FMRI scan<br/> - 28 simulations <br/> - Each has different DGPs, num of nodes (5, 50), num of observations (50 to 5000), 1400 datasets in total |
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description mentions a new test_regression_datasets.py test script, but it doesn't appear to be included in this change set (and isn't present under causalbench-asu/tests/). If automated coverage for loading these datasets is intended, please add the test file (or update the PR description to match what’s actually being delivered).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants