Skip to content

Conversation

@jpolsgrove
Copy link

Add COVID-RED Dataset, Detection/Prediction Tasks, and Example

Summary

This PR adds support for the COVID-RED (Remote Early Detection of SARS-CoV-2 infections) dataset to PyHealth, including:

  • A wearable device dataset loader (COVIDREDDataset)
  • Classification task functions (covidred_detection_fn, covidred_prediction_fn)
  • A runnable usage example (covidred_example.py)

This provides a clinically relevant wearable device dataset for PyHealth users and supports reproducible research in early infectious disease detection using consumer wearables.

Feature

1. COVIDREDDataset

  • Loads wearable device data (heart rate, steps, sleep) from the COVID-RED study
  • Returns unified time series format consistent with PyHealth signal datasets
  • Supports multiple data splits: split="train" | "test" | "all"
  • Configurable sliding window approach with window_days parameter
  • Two task modes:
    • Detection: Classify COVID-19 positive vs negative during illness period
    • Prediction: Early detection - predict COVID-19 onset before symptom appearance (1-14 days pre-symptomatic)
  • Automatic train/test split with reproducible random seed
  • Feature extraction from multivariate time series:
    • Resting heart rate statistics (mean, std, min, max)
    • Activity metrics (total steps, mean hourly steps)
    • Sleep metrics (duration, efficiency)

2. Task Functions

covidred_detection_fn
Maps dataset samples into PyHealth task format for COVID-19 detection:

{
    "patient_id": str,
    "visit_id": str,
    "signal": Tensor(n_features × window_days),
    "label": int(0 or 1),
    "metadata": dict
}

covidred_prediction_fn
Maps dataset samples for early COVID-19 prediction (pre-symptomatic detection):

  • Identifies patterns 1-14 days before symptom onset
  • Critical for early intervention and transmission reduction
  • Same output format as detection task

covidred_multiclass_fn (optional extension)
Extends to multiclass severity classification:

  • 0: COVID-19 negative
  • 1: Mild (recovered at home, no assistance)
  • 2: Moderate (recovered at home with assistance)
  • 3: Severe (hospitalized)

3. Example Script

  • Demonstrates complete pipeline: loading → task definition → LSTM classifier training
  • Implements bidirectional LSTM with attention to temporal patterns
  • Includes proper evaluation metrics (accuracy, precision, recall, F1, AUC)
  • Handles class imbalance with weighted loss
  • Saves best model based on F1-score
  • Serves as a minimal reproducible example for users

Dataset Details

Dataset: COVID-RED - Remote Early Detection of SARS-CoV-2 infections
Source: Utrecht University, Netherlands
DOI: 10.34894/FW9PO7
URL: https://dataverse.nl/dataset.xhtml?persistentId=doi:10.34894/FW9PO7

Data characteristics:

  • Wearable device measurements (Fitbit, Apple Watch, Garmin)
  • Daily aggregated metrics: heart rate, steps, sleep
  • COVID-19 test results and symptom onset dates
  • Longitudinal data across pandemic period
  • Focus on pre-symptomatic and asymptomatic detection

Clinical significance:

  • Early detection 1-14 days before symptom onset
  • Enables early intervention and isolation
  • Reduces community transmission
  • Demonstrates utility of consumer wearables for public health surveillance

Tests

Basic verification performed:

  • Dataset loads correctly from CSV files
  • Train/test split works as expected (70/30 split)
  • Both detection and prediction task functions output PyHealth-compliant dictionaries
  • Example script runs end-to-end (CPU/GPU tested)
  • Feature extraction handles missing values appropriately
  • Label distribution matches expected class imbalance
  • LSTM model architecture validated with sample data

Note on Dataset Download

The COVID-RED dataset must be manually downloaded from DataverseNL.

Users must:

  1. Visit: https://dataverse.nl/dataset.xhtml?persistentId=doi:10.34894/FW9PO7
  2. Download required files:
    • heart_rate.csv - Daily resting heart rate measurements
    • steps.csv - Daily step counts
    • sleep.csv - Daily sleep duration and efficiency
    • labels.csv - COVID-19 test results and symptom dates
  3. Place files in a directory (e.g., /data/covidred/)
  4. Initialize dataset:
    from pyhealth.datasets import COVIDREDDataset
    dataset = COVIDREDDataset(root="/data/covidred/", split="train", task="prediction")

Usage Example

from pyhealth.datasets import COVIDREDDataset
from pyhealth.tasks import covidred_prediction_fn
from torch.utils.data import DataLoader

# Load dataset for early COVID-19 prediction
dataset = COVIDREDDataset(
    root="/path/to/covidred",
    split="train",
    window_days=7,
    task="prediction"
)

# Apply task function
samples = [covidred_prediction_fn(dataset[i]) for i in range(len(dataset))]

# Create dataloader
dataloader = DataLoader(samples, batch_size=32, shuffle=True)

# Train your model
for batch in dataloader:
    signals = batch["signal"]  # Shape: (batch_size, n_features, window_days)
    labels = batch["label"]    # Shape: (batch_size,)
    # ... training code

Files Changed

This PR adds three new files to PyHealth:

  1. pyhealth/datasets/covidred.py - Dataset loader class
  2. pyhealth/tasks/covidred.py - Task functions for COVID-19 detection/prediction
  3. examples/covidred_example.py - Complete usage example with LSTM classifier

Citation

If you use this dataset implementation, please cite the original COVID-RED study:

@data{FW9PO7_2021,
  author = {Olthof, A.W. and Schut, A. and van Beijnum, B.F. and others},
  publisher = {DataverseNL},
  title = {{Remote Early Detection of SARS-CoV-2 infections (COVID-RED)}},
  year = {2021},
  version = {V1},
  doi = {10.34894/FW9PO7},
  url = {https://doi.org/10.34894/FW9PO7}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant