Skip to content

Conversation

@juaristi22
Copy link
Collaborator

@juaristi22 juaristi22 commented Feb 4, 2026

Fix #477
Fix #470

Changes (data_build.py)

  1. Parallel Execution. Data builds now run in dependency-based phases:
  • Phase 1 (parallel): uprating.py, acs.py, irs_puf.py
  • Phase 2 (parallel): cps.py, puf.py
  • Phase 3: extended_cps.py
  • Phase 4 (parallel): enhanced_cps.py, create_stratified_cps.py
  • Phase 5: small_enhanced_cps.py
  1. Checkpointing System (more robust to preemptions)
  • Creates data-build-checkpoints Modal Volume for persistent storage
  • Per-branch checkpoint paths (/checkpoints/{branch}/) prevent cross-PR contamination
  • After preemption + restart, completed scripts are skipped (restored from checkpoint)
  • Tests run module-by-module with progress tracking for mid-test-suite resume
  • Checkpoints cleaned up after successful completion
  • New --clear-checkpoints flag to force fresh builds
  1. Duplicate Test Removal
  • Previously, local area calibration tests ran twice (once explicitly, once as part of full suite)
  • Now each test module runs exactly once

Additions (create_representative_fixture.py)

  1. Smaller, Comprehensive District-Level Dataset
  • Creates a ~4k household dataset with all CDs and most counties represented.
  • Can be swapped into conftest.py for faster build and calibration testing

@juaristi22 juaristi22 requested review from baogorek and Copilot and removed request for Copilot February 4, 2026 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI workflow takes 2.5+ hours due to sequential data builds Smaller districts level dataset for testing purposes

1 participant