Add two regression datasets: California Housing and Diabetes by Shu-Wan · Pull Request #39 · CausalBenchOrg/CausalBench

Shu-Wan · 2026-02-06T01:31:37Z

Summary

This PR adds two classic regression datasets from sklearn to CausalBench for demo and testing purposes.

Datasets Added

1. California Housing Dataset

Samples: 20,640
Features: 9 (MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude, MedHouseVal)
Task: Regression - predict median house values in California districts
Source: sklearn.datasets.fetch_california_housing

2. Diabetes Dataset

Samples: 442
Features: 11 (age, sex, bmi, bp, s1-s6, target)
Task: Regression - predict disease progression from physiological variables
Source: sklearn.datasets.load_diabetes

Changes

Dataset Files

Each dataset includes:

CSV data file with all features and target
config.yaml configuration file following CausalBench schema
download_data.py script to regenerate data from sklearn

Deliverables

✅ Two regression datasets in causalbench-asu/tests/data/
✅ Compressed .zip files for each dataset
✅ Updated README.md with dataset information

Design Decisions

No causal adjacency matrices: These are only required for causal discovery tasks, not regression tasks
Classic sklearn datasets: Well-defined, documented, appropriate size for demos
Standalone regression tasks: Configured explicitly as regression tasks in descriptions

Testing

All datasets successfully load through the CausalBench framework:

✅ California Housing Dataset: PASSED
✅ Diabetes Dataset: PASSED

Total: 2/2 tests passed

kapkic · 2026-02-06T15:36:05Z

LGTM

Copilot

Pull request overview

This PR adds two sklearn-sourced regression datasets (California Housing and Diabetes) to the CausalBench test data bundle for demos/testing, along with lightweight download scripts and README documentation updates.

Changes:

Added california_housing dataset config + regeneration script (and accompanying data/zip artifacts).
Added diabetes dataset config + CSV + regeneration script (and accompanying zip artifact).
Updated README dataset table; minor formatting cleanup in zip_files.py.

Reviewed changes

Copilot reviewed 7 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
causalbench-asu/tests/zip_files.py	Minor formatting / quoting updates for zip utility.
causalbench-asu/tests/data/diabetes/download_data.py	Script to regenerate the Diabetes CSV from sklearn.
causalbench-asu/tests/data/diabetes/diabetes_data.csv	Added Diabetes dataset CSV.
causalbench-asu/tests/data/diabetes/config.yaml	Added dataset config for Diabetes.
causalbench-asu/tests/data/diabetes.zip	Added packaged dataset zip.
causalbench-asu/tests/data/california_housing/download_data.py	Script to regenerate the California Housing CSV from sklearn.
causalbench-asu/tests/data/california_housing/config.yaml	Added dataset config for California Housing.
README.md	Updated dataset list table to include the new datasets + minor whitespace cleanup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-06T18:20:20Z

causalbench-asu/tests/data/diabetes/config.yaml

+                type: continuous
+                data: decimal
+            sex:
+                header: sex
+                type: continuous
+                data: decimal
+            bmi:
+                header: bmi
+                type: continuous
+                data: decimal
+            bp:
+                header: bp
+                type: continuous
+                data: decimal
+            s1:
+                header: s1
+                type: continuous
+                data: decimal
+            s2:
+                header: s2
+                type: continuous
+                data: decimal
+            s3:
+                header: s3
+                type: continuous
+                data: decimal
+            s4:
+                header: s4
+                type: continuous
+                data: decimal
+            s5:
+                header: s5
+                type: continuous
+                data: decimal
+            s6:
+                header: s6
+                type: continuous
+                data: decimal
+            target:
+                header: target
+                type: continuous


The column type values are set to continuous, but other dataset configs in this repo use values like ratio/nominal (e.g., tests/data/panama/config.yaml). If the dataset loader only recognizes the existing enum values, continuous will fail schema validation or parsing. Please align these column type fields with the accepted values used elsewhere (e.g., use ratio for numeric continuous variables) or update the loader/schema to explicitly support continuous.

Suggested change

type: continuous

data: decimal

sex:

header: sex

type: continuous

data: decimal

bmi:

header: bmi

type: continuous

data: decimal

bp:

header: bp

type: continuous

data: decimal

s1:

header: s1

type: continuous

data: decimal

s2:

header: s2

type: continuous

data: decimal

s3:

header: s3

type: continuous

data: decimal

s4:

header: s4

type: continuous

data: decimal

s5:

header: s5

type: continuous

data: decimal

s6:

header: s6

type: continuous

data: decimal

target:

header: target

type: continuous

type: ratio

data: decimal

sex:

header: sex

type: ratio

data: decimal

bmi:

header: bmi

type: ratio

data: decimal

bp:

header: bp

type: ratio

data: decimal

s1:

header: s1

type: ratio

data: decimal

s2:

header: s2

type: ratio

data: decimal

s3:

header: s3

type: ratio

data: decimal

s4:

header: s4

type: ratio

data: decimal

s5:

header: s5

type: ratio

data: decimal

s6:

header: s6

type: ratio

data: decimal

target:

header: target

type: ratio

Actually, this is important. @prat-man @Shu-Wan I don't recall we do have any "type" for datasets. We may need to remove them. Please confirm.

Copilot · 2026-02-06T18:20:21Z

causalbench-asu/tests/data/california_housing/config.yaml

+                type: continuous
+                data: decimal
+            HouseAge:
+                header: HouseAge
+                type: continuous
+                data: decimal
+            AveRooms:
+                header: AveRooms
+                type: continuous
+                data: decimal
+            AveBedrms:
+                header: AveBedrms
+                type: continuous
+                data: decimal
+            Population:
+                header: Population
+                type: continuous
+                data: decimal
+            AveOccup:
+                header: AveOccup
+                type: continuous
+                data: decimal
+            Latitude:
+                header: Latitude
+                type: continuous
+                data: decimal
+            Longitude:
+                header: Longitude
+                type: continuous
+                data: decimal
+            MedHouseVal:
+                header: MedHouseVal
+                type: continuous


The column type fields use continuous, which is inconsistent with existing dataset configs that use values like ratio/nominal (see tests/data/panama/config.yaml, tests/data/abalone/config.yaml). If continuous isn't a recognized schema value, this dataset config may not load. Please switch these to the supported type value (likely ratio for these numeric columns) or extend the schema/loader to accept continuous.

Suggested change

type: continuous

data: decimal

HouseAge:

header: HouseAge

type: continuous

data: decimal

AveRooms:

header: AveRooms

type: continuous

data: decimal

AveBedrms:

header: AveBedrms

type: continuous

data: decimal

Population:

header: Population

type: continuous

data: decimal

AveOccup:

header: AveOccup

type: continuous

data: decimal

Latitude:

header: Latitude

type: continuous

data: decimal

Longitude:

header: Longitude

type: continuous

data: decimal

MedHouseVal:

header: MedHouseVal

type: continuous

type: ratio

data: decimal

HouseAge:

header: HouseAge

type: ratio

data: decimal

AveRooms:

header: AveRooms

type: ratio

data: decimal

AveBedrms:

header: AveBedrms

type: ratio

data: decimal

Population:

header: Population

type: ratio

data: decimal

AveOccup:

header: AveOccup

type: ratio

data: decimal

Latitude:

header: Latitude

type: ratio

data: decimal

Longitude:

header: Longitude

type: ratio

data: decimal

MedHouseVal:

header: MedHouseVal

type: ratio

Copilot · 2026-02-06T18:20:21Z

README.md

 ## Support Benchmark Context
-CausalBench is structured to support different machine learning tasks and dataset types. With user contribution, the supported context will be expanded, currently (as of 8/12/25), these models and tasks are provided.    
+CausalBench is structured to support different machine learning tasks and dataset types. With user contribution, the supported context will be expanded, currently (as of 8/12/25), these models and tasks are provided.

 | Dataset               | File                 | Description                                                                                                                                           |
 |-----------------------|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
 | Abalone               | data, static graph   |                                                                                                                                                       |
 | Adult                 | data, static graph   |                                                                                                                                                       |
 | Sachs                 | data, static graph   |                                                                                                                                                       |
+| California Housing    | data                 | Regression dataset from sklearn with 20,640 samples predicting median house values in California districts                                            |
+| Diabetes              | data                 | Regression dataset from sklearn with 442 samples predicting disease progression from physiological variables                                          |
 | NetSim                | data, static graph   | Brain FMRI scan<br/> - 28 simulations <br/> - Each has different DGPs, num of nodes (5, 50), num of observations (50 to 5000), 1400 datasets in total |


PR description mentions a new test_regression_datasets.py test script, but it doesn't appear to be included in this change set (and isn't present under causalbench-asu/tests/). If automated coverage for loading these datasets is intended, please add the test file (or update the PR description to match what’s actually being delivered).

add two regression datasets

0c91ed3

Shu-Wan assigned prat-man and kapkic Feb 6, 2026

Shu-Wan marked this pull request as ready for review February 6, 2026 18:15

Copilot AI review requested due to automatic review settings February 6, 2026 18:15

Copilot started reviewing on behalf of Shu-Wan February 6, 2026 18:15 View session

Copilot AI reviewed Feb 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add two regression datasets: California Housing and Diabetes#39

Add two regression datasets: California Housing and Diabetes#39
Shu-Wan wants to merge 1 commit intomainfrom
feat/cb+-regression-datasets

Shu-Wan commented Feb 6, 2026 •

edited

Loading

Uh oh!

kapkic commented Feb 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 6, 2026

Uh oh!

kapkic Feb 8, 2026

Uh oh!

Copilot AI Feb 6, 2026

Uh oh!

Copilot AI Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Shu-Wan commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Datasets Added

1. California Housing Dataset

2. Diabetes Dataset

Changes

Dataset Files

Deliverables

Design Decisions

Testing

Uh oh!

kapkic commented Feb 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

kapkic Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Shu-Wan commented Feb 6, 2026 •

edited

Loading