glucose-neuralforecast

Repository to experiment with the NeuralForecast library for predicting glucose levels.

What's New

🎉 GlucoseML Training Support: Train on multiple subjects with automatic episode segmentation! Load GlucoseML parquet datasets directly with --glucoseml flag. Handles discontinuous time series by automatically splitting into episodes. See docs/GLUCOSEML_TRAINING.md for details.

🎉 GlucoseML Dataset Integration: Automated pipeline to download, preprocess, and integrate 5 open-access glucose monitoring datasets (BIG IDEAS, Shanghai T1DM/T2DM, UCHTT1DM, CGMacros) from the GlucoseML repository. See docs/GLUCOSEML_USAGE.md for details.

Features

Training

Versioned training runs: Each training run gets a unique ID and isolated directory structure
Configuration persistence: Training configuration automatically saved with each run
Run management: List and select from multiple training runs for inference
Exogenous variables support: All 24 default models support exogenous variables (fast insulin, long insulin, carbs, flow_amount)
Dual training mode: Each model trained twice - univariate and with exogenous variables (48 configurations total)
YAML configuration: Define training parameters in a YAML file for reproducibility
Iterative training: Train models one-by-one with progress tracking
Crash resilience: If one model fails, training continues with the next
Incremental results: Metrics and plots saved after each model completes
Automatic visualization: Generate prediction plots for each model
Comprehensive metrics: MAE, MSE, RMSE, MAPE calculated for each model
Individual model saving: Each model saved separately for easy deployment
Detailed logging: Step numbers, progress tracking, and error logs
Wide default selection: 24 diverse models with exogenous support trained by default (48 total configurations)
Structured logging: Using Eliot for detailed execution tracking

Inference

Model loading: Load and use any trained model for predictions
Auto-detect exogenous: Automatically loads exogenous variables when using models trained with _exog suffix
Cherry-pick mode: Automatically select best or random sequences for evaluation
Multi-model comparison: Compare predictions from multiple models simultaneously
Comparison plots: Generate professional visualizations comparing all models
Flexible prediction: Predict on specific sequences or entire datasets
Batch processing: Run inference with multiple models in parallel
Model selection: Choose specific models or use all trained models

Visualization

Dual backends: Choose between matplotlib (legacy) or plotly (enhanced, default)
Interactive plots: Plotly plots support zooming, panning, and hover tooltips
All time points visible: Display every time point on x-axis without skipping
Vertical labels: Space-efficient vertical tick labels prevent overlapping
Dual output formats: Generate both interactive HTML and static PNG files
Model comparison: Compare multiple models in a single interactive plot
Interactive dashboards: Multi-sequence dashboards with subplots
Customizable: Control tick angles, plot dimensions, and styling

Installation

Linux/Mac

uv sync

Windows

uv sync

Windows Users: This project uses Ray (via neuralforecast) which has specific Windows considerations. See docs/WINDOWS_SETUP.md for detailed Windows setup instructions, including:

Ray configuration for Windows
Environment variables setup
Common issues and solutions
WSL2 setup (recommended for best performance)

Quick Windows Setup (Python 3.12):

# If you encounter Ray issues, install Windows-specific wheel first
pip install https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp312-cp312-win_amd64.whl

# Set Ray temporary directory to avoid long path issues
$env:RAY_TMPDIR = "$HOME\ray_tmp"
$env:RAY_ENABLE_WINDOWS_OR_OSX_CLUSTER = "1"

# Install remaining dependencies
uv sync

For more details, see the Windows Setup Guide.

Quick Start: GlucoseML Datasets

Train on open-access datasets with a single command:

# Download and preprocess all 5 datasets to parquet
uv run glucoseml pipeline --datasets all

# Train on the integrated data
uv run train --data-file data/input/glucoseml/BIG_IDEAS.parquet --max-steps 2000

For detailed documentation, see docs/GLUCOSEML_USAGE.md

Available datasets:

BIG IDEAs (PhysioNet): 16 subjects, Dexcom, 5-min sampling
ShanghaiT1DM/T2DM (Figshare): Type 1/2 diabetes, FreeStyle Libre, 15-min sampling
UCHTT1DM (GitHub): Type 1 diabetes, mixed sensors, 5-min sampling
CGMacros (PhysioNet): Dexcom/Libre with macronutrients, 5/15-min sampling

All datasets include exogenous covariates (insulin, carbs, heart rate, etc.) where available.

Usage

Quick Start

Train models (automatically creates a versioned run):

uv run train

Or train on GlucoseML datasets (multiple subjects):

uv run train train --glucoseml

Run inference (automatically uses latest training run):

uv run predict --cherry-pick

List all training runs:

uv run list-runs

Use specific training run (optional):

uv run predict --run-id run_20241021_130000 --cherry-pick

Training Runs Management

Each training session creates a unique run with its own directory:

Structure: data/output/runs/<run_id>/
Run ID: Automatically generated timestamp (e.g., run_20241021_130601) or custom via --run-id
Contents: models/, config.yaml, metrics.csv, cv_results_*.csv

List all training runs:

uv run list-runs

Output example:

📊 Available Training Runs (2):

Run ID                    Models   Horizon  Steps    Status
================================================================================
run_20241021_143000      12       12       1000     📋 config 🤖 models
run_20241021_120000      8        12       500      📋 config 🤖 models

💡 Use --run-id to specify a run for inference
   Example: predict --run-id run_20241021_143000

List Available Models

To see all available models (models with 🔗 support exogenous variables):

uv run list-models

Training on GlucoseML Datasets

Train on multiple subjects from GlucoseML parquet files with automatic episode segmentation:

# Train on all GlucoseML datasets
uv run train train --glucoseml

# Train on specific datasets
uv run train train --glucoseml --datasets 'BIG_IDEAS,ShanghaiT1DM'

# Custom episode parameters
uv run train train --glucoseml --max-gap-minutes 90 --min-episode-length 72

# With specific models
uv run train train --glucoseml --models 'NHITS,LSTM,TFT' --max-steps 2000

Key Features:

✅ Auto episode detection: Splits discontinuous data by time gaps
✅ Multiple subjects: Handles 100+ subjects from 5 datasets
✅ Configurable gaps: Set max gap threshold (default: 60 min)
✅ Episode filtering: Removes too-short episodes
✅ Schema alignment: Aligns different dataset formats
✅ Exogenous support: Includes carbs, insulin, glucose_rate

Parameters:

--glucoseml: Enable GlucoseML data loading
--datasets: Comma-separated dataset names (optional)
--max-gap-minutes: Time gap threshold (default: 60)
--min-episode-length: Min points per episode (default: 48)

📖 Complete guide: docs/GLUCOSEML_TRAINING.md

Fine-tuning (Warm Start)

You can warm-start univariate training from a previously trained model (fine-tune). This is useful to pretrain on GlucoseML and then adapt to your Livia CSV data.

Warm start works only for univariate runs (no exogenous). Exogenous variants are intentionally not warm-started due to covariate mismatches across datasets.
The pretrained model’s h (horizon) and input_size must match the current run.
You can point to a prior run by ID or directly to a model directory.

Step 1: Pretrain on GlucoseML (example with CGMacros)

# Optional: download and preprocess datasets
uv run glucoseml pipeline --datasets CGMacros

# Train on CGMacros with chosen models and hyperparameters
uv run train --glucoseml --datasets CGMacros \
  --models "NHITS,NBEATSx,TimeXer,MLP,DilatedRNN" \
  --horizon 12 --input-size 48 --max-steps 2000

# Note the printed run ID, e.g., run_20251027_192814

Step 2: Fine-tune on Livia CSV using that run as initialization

uv run train --data-file data/input/livia_glucose.csv \
  --models "NHITS,NBEATSx,TimeXer,MLP,DilatedRNN" \
  --horizon 12 --input-size 48 \
  --init-run-id run_20251029_030738

Alternative: Provide a direct path instead of a run ID

# You can pass a run root (the tool resolves models/<model_name>)
uv run train --data-file data/input/livia_glucose.csv \
  --models "NHITS,NBEATSx,TimeXer,MLP,DilatedRNN" \
  --horizon 12 --input-size 48 \
  --init-model /home/antonkulaga/sources/glucose-neuralforecast/data/output/runs/run_20251029_025206

# Or pass a specific model directory (for a single model)
uv run train --data-file data/input/livia_glucose.csv \
  --models "NHITS" \
  --horizon 12 --input-size 48 \
  --init-model /home/antonkulaga/sources/glucose-neuralforecast/data/output/runs/run_20251029_025206/models/NHITS

Behavior details

Per-model resolution: for each model in the current run, the trainer searches the specified run/model directory and warm-starts only if a compatible checkpoint is found. Otherwise that model trains from scratch.
Exogenous variants (e.g., NHITS_exog) are ignored for warm start by design; only base univariate models attempt warm start.
If model class or hyperparameters do not match, warm start is skipped automatically.

Training with YAML Configuration (Recommended)

Generate a default configuration file:

uv run generate-config --output train_config.yaml

Edit the generated train_config.yaml file to customize training parameters, then train:

uv run train-from-config --config train_config.yaml

Example train_config.yaml:

run_id: null  # Auto-generated timestamp if null (e.g., run_20241021_130601)
data_file: null  # Uses default data/input/livia_glucose.csv
output_dir: null  # Uses default data/output
horizon: 12
input_size: 48
max_steps: 1000
models:
- NBEATS
- NHITS
- LSTM
- GRU
- TCN
n_windows: 3
test_size: null
log_file: null
plotting_backend: plotly  # 'plotly' (default) or 'matplotlib'

Visualization with Plotly (Enhanced)

The project now includes enhanced plotly-based visualization with better tick display and interactivity.

Key Features:

✅ All time points visible on x-axis (no skipping)
✅ Vertical labels to save space
✅ Interactive zooming and panning
✅ Hover tooltips with exact values
✅ Both HTML (interactive) and PNG (static) outputs

Using plotly backend (default):

# Train with plotly visualization
uv run train --plotting-backend plotly --models "LSTM,NHITS"

# Or set in config file
plotting_backend: plotly

Customization in Python:

from glucose_neuralforecast.plotting_plotly import plot_predictions_plotly

plot_predictions_plotly(
    df=df,
    cv_df=predictions,
    model_name="LSTM",
    output_path=output_dir,
    show_all_ticks=True,   # Show every time point
    tickangle=-90,          # Vertical labels (-90°)
    height=600,
    width=1400,
)

For detailed documentation, see docs/PLOTLY_VISUALIZATION.md

Basic Training (Command Line)

Train default 12 models (automatically creates a versioned run):

uv run train

Train with custom run ID:

uv run train --run-id my_experiment_v1

Train specific models:

uv run train --models "NBEATS,NHITS,LSTM" --max-steps 2000

This will train models iteratively, showing progress and saving results after each model completes. Each training run is saved in data/output/runs/<run_id>/ with its own config.yaml.

Note: Models that support exogenous variables will be automatically trained twice:

Once as univariate (glucose only) - e.g., LSTM
Once with exogenous variables (insulin, carbs, flow_amount) - e.g., LSTM_exog

Custom Model Selection

Train specific models:

uv run train --models "NBEATS,NHITS,LSTM,GRU,MLP"

Available Models

MLP-based models:

NBEATS - Neural Basis Expansion Analysis
NBEATSx - NBEATS with exogenous variables
NHITS - Neural Hierarchical Interpolation for Time Series
MLP - Multi-Layer Perceptron
MLPMultivariate - Multivariate MLP

RNN-based models:

LSTM - Long Short-Term Memory
GRU - Gated Recurrent Unit
RNN - Recurrent Neural Network
DilatedRNN - Dilated Recurrent Neural Network

CNN-based models:

TCN - Temporal Convolutional Network
BiTCN - Bidirectional Temporal Convolutional Network

Linear models:

DLinear - Decomposition Linear
NLinear - Normalization Linear

Transformer-based models:

VanillaTransformer - Standard Transformer
Informer - Informer: Beyond Efficient Transformer
Autoformer - Autoformer with Auto-Correlation
FEDformer - Frequency Enhanced Decomposed Transformer
PatchTST - Patch Time Series Transformer
iTransformer - Inverted Transformer

Specialized models:

TFT - Temporal Fusion Transformer
DeepAR - Deep Autoregressive
DeepNPTS - Deep Neural Point Time Series
TiDE - Time-series Dense Encoder
HINT - Hierarchical Interpolation Network for Time series

GNN and advanced models:

StemGNN - Spectral Temporal Graph Neural Network
SOFTS - Self-Organizing Fuzzy Time Series

Recent/advanced architectures:

TimesNet - TimesNet with Period Detection
TimeLLM - Time Series with Large Language Models
TimeMixer - Time Series Mixing
TimeXer - Time Series Cross-series
TSMixer - Time Series Mixer
TSMixerx - TSMixer with exogenous variables

KAN models:

KAN - Kolmogorov-Arnold Networks
RMoK - Recurrent Mixture of KANs

Advanced Options

train \
  --data-file data/input/livia_glucose.csv \
  --output-dir data/output \
  --horizon 12 \
  --input-size 48 \
  --max-steps 1000 \
  --models "NBEATS,NHITS,LSTM" \
  --n-windows 3 \
  --test-size 100

Example with transformer models:

train --models "PatchTST,iTransformer,Autoformer" --max-steps 500

Example with multiple model types:

train --models "NBEATS,LSTM,TCN,DLinear,TFT" --horizon 24 --input-size 96

Command Line Arguments

--data-file, -d: Path to glucose CSV file (default: data/input/livia_glucose.csv)
--output-dir, -o: Directory to save outputs (default: data/output)
--horizon, -h: Forecast horizon in time steps (default: 12)
--input-size, -i: Number of historical time steps to use (default: 48)
--max-steps, -s: Maximum training steps per model (default: 1000)
--models, -m: Comma-separated list of models to train
--n-windows, -n: Number of cross-validation windows (default: 3)
--test-size, -t: Size of test set (overrides n-windows if provided)
--log-file, -l: Path to log file (default: data/output/training.log)

Output Structure

After training, the following files are generated:

data/output/
├── models/                           # Individual model directories
│   ├── NBEATS/
│   │   ├── configuration.pkl
│   │   ├── dataset.pkl
│   │   └── NBEATS_0.ckpt
│   ├── NHITS/
│   ├── LSTM/
│   └── ... (one directory per trained model)
├── plots/                            # Prediction visualizations (using utilsforecast)
│   ├── NBEATS/
│   │   ├── sequence_0.png
│   │   ├── sequence_1.png
│   │   └── sequence_2.png
│   ├── NHITS/
│   └── ... (plots for each model)
├── metrics.csv                       # Model performance metrics (updated after each model)
├── cv_results_NBEATS.csv            # Cross-validation results per model
├── cv_results_NHITS.csv
├── ... (one CV file per model)
├── error_ModelName.txt               # Error logs for failed models (if any)
├── training_summary.txt              # Final summary report
└── training.log                      # Structured eliot logs

Incremental Saving

Results are saved after each model completes:

✅ If one model fails, you still have results from successful models
✅ Metrics CSV is updated incrementally
✅ Each model's CV results and plots are saved immediately
✅ Progress is visible with step numbers (Step 3/12, etc.)

Metrics

The following metrics are calculated for each model:

MAE (Mean Absolute Error): Average absolute difference between predictions and actual values
MSE (Mean Squared Error): Average squared difference between predictions and actual values
RMSE (Root Mean Squared Error): Square root of MSE
MAPE (Mean Absolute Percentage Error): Average absolute percentage error

Data Format

Input CSV should contain:

sequence_id: Identifier for time series sequences
Timestamp (YYYY-MM-DDThh:mm:ss): Timestamp column
Event Type: Type of event (filtered to 'EGV' for glucose values)
Glucose Value (mg/dL): Target glucose measurements
Fast-Acting Insulin Value (u): Fast-acting insulin doses (optional, for exogenous models)
Long-Acting Insulin Value (u): Long-acting insulin doses (optional, for exogenous models)
Carb Value (grams): Carbohydrate intake (optional, for exogenous models)
flow_amount: Flow amount (optional, for exogenous models)

The data is automatically converted to NeuralForecast format:

Univariate format (base models):

unique_id: Sequence identifier
ds: Datetime column
y: Target values (glucose levels)

Multivariate format (exogenous models):

unique_id: Sequence identifier
ds: Datetime column
y: Target values (glucose levels)
fast_insulin: Fast-acting insulin (filled with 0 for missing values)
long_insulin: Long-acting insulin (filled with 0 for missing values)
carbs: Carbohydrate intake (filled with 0 for missing values)
flow_amount: Flow amount (filled with 0 for missing values)

Cross-Validation

The training uses NeuralForecast's cross-validation functionality:

Data is automatically split into training and test sets
Models are evaluated on multiple windows for robust performance assessment
Step size is set to the forecast horizon for non-overlapping predictions
Each model's predictions are saved for detailed analysis

Inference

After training models, you can use them for prediction and comparison.

List Available Trained Models

See which models have been trained:

uv run list-trained

Run Predictions

Predict with Cherry-Pick Mode (Best Sequence)

Automatically select the sequence with the best MAE and compare all trained models:

uv run predict --run-id run_20241021_130601 --cherry-pick

This will:

Load models from the specified training run
Find the sequence (unique_id) with the best average MAE across all models
Run inference with all trained models on that sequence
Save predictions to CSV files
Generate a comparison plot showing all model predictions

Predict with Random Sequence

Select a random sequence for comparison:

uv run predict --run-id run_20241021_130601 --cherry-pick --random

Predict on Specific Sequence

Predict on a specific sequence by unique_id:

uv run predict --run-id run_20241021_130601 --unique-id "3"

Predict with Selected Models Only

Use only specific models for prediction:

uv run predict --run-id run_20241021_130601 --cherry-pick --models "NBEATS,NHITS,LSTM"

Run Without Plotting

Skip plot generation (useful for large-scale predictions):

uv run predict --run-id run_20241021_130601 --cherry-pick --no-plot

Save Predictions Without Plots

Save predictions but don't generate comparison plots:

uv run predict --run-id run_20241021_130601 --cherry-pick --no-save-predictions

Inference Command Options

--data-file, -d: Path to glucose CSV file (default: data/input/livia_glucose.csv)
--output-dir, -o: Base output directory (default: data/output)
--run-id, -r: Training run ID to use (e.g., run_20241021_130601). If not provided, uses legacy mode
--models, -m: Comma-separated list of models to use for inference
--cherry-pick, -c: Enable cherry-pick mode to select a specific sequence
--best/--random: In cherry-pick mode, select best MAE sequence or random
--unique-id, -u: Specific unique_id to predict (overrides cherry-pick)
--seed, -s: Random seed for reproducibility with random selection
--save-predictions/--no-save-predictions: Save predictions to CSV (default: True)
--plot/--no-plot: Generate comparison plot (default: True)

Inference Output

Inference generates the following outputs within the run directory:

data/output/runs/<run_id>/
├── config.yaml                      # Training configuration
├── models/                          # Trained model checkpoints
├── metrics.csv                      # Evaluation metrics
├── cv_results_*.csv                 # Cross-validation results
├── predictions/                     # Prediction CSV files
│   ├── predictions_NBEATS.csv
│   ├── predictions_NHITS.csv
│   ├── predictions_LSTM.csv
│   └── ... (one file per model)
└── plots/
    └── comparison/                  # Model comparison plots
        └── comparison_<unique_id>.png

Usage Examples

Example 1: Quick comparison of best sequence from specific run

# List available training runs
uv run list-runs

# Find best sequence and compare all models from a specific run
uv run predict --run-id run_20241021_130601 --cherry-pick

Example 2: Compare specific models on best sequence

# Compare only NBEATS, LSTM, and TCN on best sequence
uv run predict --run-id run_20241021_130601 --cherry-pick --models "NBEATS,LSTM,TCN"

Example 3: Reproducible random selection

# Randomly select sequence with fixed seed
uv run predict --run-id run_20241021_130601 --cherry-pick --random --seed 42

Example 4: Predict on specific patient data

# Predict on specific unique_id
uv run predict --run-id run_20241021_130601 --unique-id "3" --models "NHITS,LSTM,GRU"

Example 5: Batch predictions without plots

# Generate predictions for all sequences without plotting (legacy mode)
uv run predict --no-plot

Cherry-Pick Mode

Cherry-pick mode helps you quickly evaluate and compare model performance:

Best mode (default): Selects the sequence with the lowest average MAE across all models
- Useful for seeing how models perform on "easy" sequences
- Good for validating that models learned meaningful patterns
Random mode: Randomly selects a sequence
- Useful for unbiased evaluation
- Can reveal how models handle different types of patterns
- Use --seed for reproducibility

Comparison Plots

The comparison plots generated by inference show:

Historical data: Previous glucose values (up to 500 points)
Predictions: Forecast from each selected model
Multiple models: All models overlaid for easy comparison
Time series continuity: Seamless transition from history to forecast

Plots use the utilsforecast.plotting.plot_series function for professional visualization.

Example Scripts

Programmatic Inference Example

See examples/inference_example.py for a complete example of using the inference module programmatically:

from glucose_neuralforecast.inference import (
    get_available_trained_models,
    cherry_pick_sequence,
    run_inference,
    plot_model_comparison,
)

# Get available models
available_models = get_available_trained_models(models_dir)

# Cherry-pick best sequence
selected_id = cherry_pick_sequence(metrics_path, best=True)

# Run inference
predictions = run_inference(df_filtered, models_to_use, models_dir)

# Generate comparison plot
plot_model_comparison(df_pandas, predictions, selected_id, plots_dir)

Run the example:

cd examples
python inference_example.py

Development

After making changes to dependencies:

uv sync

Install development dependencies:

uv sync --group dev

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
docs		docs
examples		examples
notebooks		notebooks
src/glucose_neuralforecast		src/glucose_neuralforecast
tests		tests
.env.template		.env.template
.gitignore		.gitignore
.python-version		.python-version
.rayconfig		.rayconfig
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
test_plotly_debug.py		test_plotly_debug.py
train_config.yaml		train_config.yaml
uv.lock		uv.lock

License

GlucoseDAO/glucose-neuralforecast

Folders and files

Latest commit

History

Repository files navigation

glucose-neuralforecast

What's New

Features

Training

Inference

Visualization

Installation

Linux/Mac

Windows

Quick Start: GlucoseML Datasets

Usage

Quick Start

Training Runs Management

List Available Models

Training on GlucoseML Datasets

Fine-tuning (Warm Start)

Training with YAML Configuration (Recommended)

Visualization with Plotly (Enhanced)

Basic Training (Command Line)

Custom Model Selection

Available Models

Advanced Options

Command Line Arguments

Output Structure

Incremental Saving

Metrics

Data Format

Cross-Validation

Inference

List Available Trained Models

Run Predictions

Predict with Cherry-Pick Mode (Best Sequence)

Predict with Random Sequence

Predict on Specific Sequence

Predict with Selected Models Only

Run Without Plotting

Save Predictions Without Plots

Inference Command Options

Inference Output

Usage Examples

Cherry-Pick Mode

Comparison Plots

Example Scripts

Programmatic Inference Example

Development

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages