Add two regression datasets: California Housing and Diabetes#39
Add two regression datasets: California Housing and Diabetes#39
Conversation
|
LGTM |
There was a problem hiding this comment.
Pull request overview
This PR adds two sklearn-sourced regression datasets (California Housing and Diabetes) to the CausalBench test data bundle for demos/testing, along with lightweight download scripts and README documentation updates.
Changes:
- Added
california_housingdataset config + regeneration script (and accompanying data/zip artifacts). - Added
diabetesdataset config + CSV + regeneration script (and accompanying zip artifact). - Updated README dataset table; minor formatting cleanup in
zip_files.py.
Reviewed changes
Copilot reviewed 7 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| causalbench-asu/tests/zip_files.py | Minor formatting / quoting updates for zip utility. |
| causalbench-asu/tests/data/diabetes/download_data.py | Script to regenerate the Diabetes CSV from sklearn. |
| causalbench-asu/tests/data/diabetes/diabetes_data.csv | Added Diabetes dataset CSV. |
| causalbench-asu/tests/data/diabetes/config.yaml | Added dataset config for Diabetes. |
| causalbench-asu/tests/data/diabetes.zip | Added packaged dataset zip. |
| causalbench-asu/tests/data/california_housing/download_data.py | Script to regenerate the California Housing CSV from sklearn. |
| causalbench-asu/tests/data/california_housing/config.yaml | Added dataset config for California Housing. |
| README.md | Updated dataset list table to include the new datasets + minor whitespace cleanup. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| type: continuous | ||
| data: decimal | ||
| sex: | ||
| header: sex | ||
| type: continuous | ||
| data: decimal | ||
| bmi: | ||
| header: bmi | ||
| type: continuous | ||
| data: decimal | ||
| bp: | ||
| header: bp | ||
| type: continuous | ||
| data: decimal | ||
| s1: | ||
| header: s1 | ||
| type: continuous | ||
| data: decimal | ||
| s2: | ||
| header: s2 | ||
| type: continuous | ||
| data: decimal | ||
| s3: | ||
| header: s3 | ||
| type: continuous | ||
| data: decimal | ||
| s4: | ||
| header: s4 | ||
| type: continuous | ||
| data: decimal | ||
| s5: | ||
| header: s5 | ||
| type: continuous | ||
| data: decimal | ||
| s6: | ||
| header: s6 | ||
| type: continuous | ||
| data: decimal | ||
| target: | ||
| header: target | ||
| type: continuous |
There was a problem hiding this comment.
The column type values are set to continuous, but other dataset configs in this repo use values like ratio/nominal (e.g., tests/data/panama/config.yaml). If the dataset loader only recognizes the existing enum values, continuous will fail schema validation or parsing. Please align these column type fields with the accepted values used elsewhere (e.g., use ratio for numeric continuous variables) or update the loader/schema to explicitly support continuous.
| type: continuous | |
| data: decimal | |
| sex: | |
| header: sex | |
| type: continuous | |
| data: decimal | |
| bmi: | |
| header: bmi | |
| type: continuous | |
| data: decimal | |
| bp: | |
| header: bp | |
| type: continuous | |
| data: decimal | |
| s1: | |
| header: s1 | |
| type: continuous | |
| data: decimal | |
| s2: | |
| header: s2 | |
| type: continuous | |
| data: decimal | |
| s3: | |
| header: s3 | |
| type: continuous | |
| data: decimal | |
| s4: | |
| header: s4 | |
| type: continuous | |
| data: decimal | |
| s5: | |
| header: s5 | |
| type: continuous | |
| data: decimal | |
| s6: | |
| header: s6 | |
| type: continuous | |
| data: decimal | |
| target: | |
| header: target | |
| type: continuous | |
| type: ratio | |
| data: decimal | |
| sex: | |
| header: sex | |
| type: ratio | |
| data: decimal | |
| bmi: | |
| header: bmi | |
| type: ratio | |
| data: decimal | |
| bp: | |
| header: bp | |
| type: ratio | |
| data: decimal | |
| s1: | |
| header: s1 | |
| type: ratio | |
| data: decimal | |
| s2: | |
| header: s2 | |
| type: ratio | |
| data: decimal | |
| s3: | |
| header: s3 | |
| type: ratio | |
| data: decimal | |
| s4: | |
| header: s4 | |
| type: ratio | |
| data: decimal | |
| s5: | |
| header: s5 | |
| type: ratio | |
| data: decimal | |
| s6: | |
| header: s6 | |
| type: ratio | |
| data: decimal | |
| target: | |
| header: target | |
| type: ratio |
| type: continuous | ||
| data: decimal | ||
| HouseAge: | ||
| header: HouseAge | ||
| type: continuous | ||
| data: decimal | ||
| AveRooms: | ||
| header: AveRooms | ||
| type: continuous | ||
| data: decimal | ||
| AveBedrms: | ||
| header: AveBedrms | ||
| type: continuous | ||
| data: decimal | ||
| Population: | ||
| header: Population | ||
| type: continuous | ||
| data: decimal | ||
| AveOccup: | ||
| header: AveOccup | ||
| type: continuous | ||
| data: decimal | ||
| Latitude: | ||
| header: Latitude | ||
| type: continuous | ||
| data: decimal | ||
| Longitude: | ||
| header: Longitude | ||
| type: continuous | ||
| data: decimal | ||
| MedHouseVal: | ||
| header: MedHouseVal | ||
| type: continuous |
There was a problem hiding this comment.
The column type fields use continuous, which is inconsistent with existing dataset configs that use values like ratio/nominal (see tests/data/panama/config.yaml, tests/data/abalone/config.yaml). If continuous isn't a recognized schema value, this dataset config may not load. Please switch these to the supported type value (likely ratio for these numeric columns) or extend the schema/loader to accept continuous.
| type: continuous | |
| data: decimal | |
| HouseAge: | |
| header: HouseAge | |
| type: continuous | |
| data: decimal | |
| AveRooms: | |
| header: AveRooms | |
| type: continuous | |
| data: decimal | |
| AveBedrms: | |
| header: AveBedrms | |
| type: continuous | |
| data: decimal | |
| Population: | |
| header: Population | |
| type: continuous | |
| data: decimal | |
| AveOccup: | |
| header: AveOccup | |
| type: continuous | |
| data: decimal | |
| Latitude: | |
| header: Latitude | |
| type: continuous | |
| data: decimal | |
| Longitude: | |
| header: Longitude | |
| type: continuous | |
| data: decimal | |
| MedHouseVal: | |
| header: MedHouseVal | |
| type: continuous | |
| type: ratio | |
| data: decimal | |
| HouseAge: | |
| header: HouseAge | |
| type: ratio | |
| data: decimal | |
| AveRooms: | |
| header: AveRooms | |
| type: ratio | |
| data: decimal | |
| AveBedrms: | |
| header: AveBedrms | |
| type: ratio | |
| data: decimal | |
| Population: | |
| header: Population | |
| type: ratio | |
| data: decimal | |
| AveOccup: | |
| header: AveOccup | |
| type: ratio | |
| data: decimal | |
| Latitude: | |
| header: Latitude | |
| type: ratio | |
| data: decimal | |
| Longitude: | |
| header: Longitude | |
| type: ratio | |
| data: decimal | |
| MedHouseVal: | |
| header: MedHouseVal | |
| type: ratio |
| ## Support Benchmark Context | ||
| CausalBench is structured to support different machine learning tasks and dataset types. With user contribution, the supported context will be expanded, currently (as of 8/12/25), these models and tasks are provided. | ||
| CausalBench is structured to support different machine learning tasks and dataset types. With user contribution, the supported context will be expanded, currently (as of 8/12/25), these models and tasks are provided. | ||
|
|
||
| | Dataset | File | Description | | ||
| |-----------------------|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| | Abalone | data, static graph | | | ||
| | Adult | data, static graph | | | ||
| | Sachs | data, static graph | | | ||
| | California Housing | data | Regression dataset from sklearn with 20,640 samples predicting median house values in California districts | | ||
| | Diabetes | data | Regression dataset from sklearn with 442 samples predicting disease progression from physiological variables | | ||
| | NetSim | data, static graph | Brain FMRI scan<br/> - 28 simulations <br/> - Each has different DGPs, num of nodes (5, 50), num of observations (50 to 5000), 1400 datasets in total | |
There was a problem hiding this comment.
PR description mentions a new test_regression_datasets.py test script, but it doesn't appear to be included in this change set (and isn't present under causalbench-asu/tests/). If automated coverage for loading these datasets is intended, please add the test file (or update the PR description to match what’s actually being delivered).
Summary
This PR adds two classic regression datasets from sklearn to CausalBench for demo and testing purposes.
Datasets Added
1. California Housing Dataset
2. Diabetes Dataset
Changes
Dataset Files
Each dataset includes:
config.yamlconfiguration file following CausalBench schemadownload_data.pyscript to regenerate data from sklearnDeliverables
causalbench-asu/tests/data/.zipfiles for each datasetDesign Decisions
Testing
All datasets successfully load through the CausalBench framework: