-
Notifications
You must be signed in to change notification settings - Fork 7
feat(smoking): Add smoking variable harmonization (CEP-002) #163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v3-den-132
Are you sure you want to change the base?
Conversation
Foundation for modular derived variable calculations: - clean_variables(): Step 1 & 3 preprocessing/validation - missing-data-functions.R: any_missing(), get_priority_missing() - missing-pattern-cache.R: Pattern detection for PUMF/Master codes - parse-range-notation.R: Range parsing for validation bounds - worksheet-getters.R: get_variable_details() metadata access - worksheet-loaders.R: load_worksheet_metadata() - file-sourcing.R: source_r_robust() for dependency loading - variable-discovery.R: Variable lookup utilities This infrastructure supports the new 3-step pattern: 1. clean_variables() - preprocess inputs 2. Domain logic - status-based calculations 3. clean_variables() - validate output bounds
Primary recommended variables using 3-step architecture: - smoking-status.R: calculate_SMKDSTY_A(), calculate_SMKDSTY_cat6() - smoke-start.R: calculate_age_start_smoking() - unified initiation age - smoking-cessation.R: calculate_time_quit_smoking() - years since quit - smoke-intensity.R: calculate_cigs_per_day() - routes SMK_204/SMK_208 - smoke-pack-years.R: calculate_pack_years() - cumulative exposure - smoke-stop.R: Supporting cessation logic - smoking-validation-constants.R: PACK_YEARS_CONSTANTS Key design decisions: - Single calculate_pack_years() works for both PUMF and Master - Unified feeders (age_start_smoking, cigs_per_day, time_quit_smoking) handle PUMF vs Master routing internally - PUMF has ~15-20% relative error due to midpoint estimation - Era-agnostic: handles 2001-2023 variable naming variations See ceps/cep-002-smoking/ for full specification.
Worksheets for smoking variable harmonization: - smoking_variables.csv: Variable definitions for smoking domain - smoking_variable_details.csv: Recoding rules and mappings Covers 5 subgroups: - 01-status: SMKDSTY_cat6 (6-category smoking status) - 02-initiation: age_start_smoking (age started daily) - 03-cessation: time_quit_smoking (years since quit) - 04-intensity: cigs_per_day (cigarettes per day) - 05-pack-years: pack_years_der (cumulative exposure) Supports all PUMF cycles 2001-2022.
Tests for primary recommended smoking variables: - test-age_start_smoking.R: Initiation age routing and bounds - test-cigs_per_day.R: Intensity routing by smoking status - test-pack_years.R: Cumulative exposure calculation - test-time_quit_smoking.R: Cessation timing validation Tests verify: - Correct routing based on SMKDSTY_A status - Valid output ranges per variable_details.csv bounds - Missing value handling (tagged_na vs numeric) - Universe validation (correct NA for out-of-scope respondents)
Quarto documentation for smoking variable harmonization: Main documents: - cep-002-smoking.qmd: Methodology and rationale - 00-variable-summary.qmd: Variable overview and recommendations - derived-functions.qmd: DV function specifications Subgroup specifications (QMD + worksheet CSVs): - 01-status: SMKDSTY_cat6 (6-category smoking status) - 02-initiation: age_start_smoking - 03-cessation: time_quit_smoking - 04-intensity: cigs_per_day - 05-pack-years: pack_years_der Rendered site: https://dmanuel.quarto.pub/cep-002-smoking-variables
Updates variables.csv and variable_details.csv with smoking harmonization: - 34 existing smoking variables updated with v3 definitions - 19 new smoking variables added - Extends coverage to 2022-2023 Master files - Adds unified feeder variables (age_start_smoking, cigs_per_day, etc.) Removes separate smoking_*.csv files (now merged into main worksheets). Variables: 360 → 379 (+19 new) Variable details: 3468 → 3678 (+210 net, replacing 500 with 710 improved rows)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds comprehensive smoking variable harmonization for CCHS cycles 2001-2023, implementing CEP-002. It introduces 19 new variables and updates 34 existing ones, extending coverage to Master files and establishing a new 3-step derived variable architecture.
Changes:
- New smoking variables across 5 subgroups (status, initiation, cessation, intensity, pack-years)
- Extended cycle coverage to PUMF 2001-2023 and Master 2001-2023
- New unified derived variable functions with standardized architecture
- Comprehensive test suite for derived variables
- Full CEP documentation in Quarto format
Reviewed changes
Copilot reviewed 38 out of 42 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| tests/testthat/test-time_quit_smoking.R | Unit tests for cessation timing calculation (337 lines) |
| tests/testthat/test-pack_years.R | Unit tests for pack-years derivation (386 lines) |
| tests/testthat/test-cigs_per_day.R | Unit tests for cigarettes per day routing (632 lines) |
| tests/testthat/test-age_start_smoking.R | Unit tests for initiation age calculation (353 lines) |
| ceps/cep-002-smoking/derived-functions.qmd | Function reference documentation (448 lines) |
| ceps/cep-002-smoking/cep-002-smoking.qmd | Main methodology and rationale (598 lines) |
| ceps/cep-002-smoking/05-pack-years.qmd | Pack-years variable documentation (276 lines) |
| ceps/cep-002-smoking/04-intensity.qmd | Intensity variable documentation (497 lines) |
| ceps/cep-002-smoking/03-cessation.qmd | Cessation variable documentation (555 lines) |
| ceps/cep-002-smoking/02-initiation.qmd | Initiation variable documentation (808 lines) |
| ceps/cep-002-smoking/00-variable-summary.qmd | Variable summary table (266 lines) |
| CSV worksheets (multiple) | Variable definitions and recoding rules |
| R/smoking-validation-constants.R | Validation constants for smoking functions |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "REP_5A","Repetitive strain injury - walking","Repetitive strain injury - type of activity - walking","Categorical","cchs2009_2010_p, cchs2010_p, cchs2011_2012_p, cchs2012_p, cchs2013_2014_p, cchs2014_p, cchs2015_2016_p, cchs2017_2018_p, cchs2009_s, cchs2010_s, cchs2012_s","cchs2015_2016_p::INJ_020A, cchs2017_2018_p::INJ_020A, [REP_5A]","Repetitive strain injury","Health status","N/A",NA,"","2.2.0","2025-06-30","Variable metadata completed","",NA,"active",NA | ||
| "REP_5B","Repetitive strain injury - sport/physical activity","Repetitive strain injury - type of activity - sports or physical exercise","Categorical","cchs2001_p, cchs2003_p, cchs2005_p, cchs2007_2008_p, cchs2009_2010_p, cchs2010_p, cchs2011_2012_p, cchs2012_p, cchs2013_2014_p, cchs2014_p, cchs2015_2016_p, cchs2017_2018_p, cchs2009_s, cchs2010_s, cchs2012_s","cchs2001_p::REPA_4A, cchs2003_p::REPC_4A, cchs2005_p::REPE_4A, cchs2007_2008_p::REP_4A, cchs2015_2016_p::INJ_020B, cchs2017_2018_p::INJ_020B, [REP_5B]","Repetitive strain injury","Health status","N/A",NA,"","2.2.0","2025-06-30","Variable metadata completed","",NA,"active",NA | ||
| "REP_5C","Repetitive strain injury - leisure/hobby","Repetitive strain injury - type of activity - leisure or hobby","Categorical","cchs2001_p, cchs2003_p, cchs2005_p, cchs2007_2008_p, cchs2009_2010_p, cchs2010_p, cchs2011_2012_p, cchs2012_p, cchs2013_2014_p, cchs2014_p, cchs2015_2016_p, cchs2017_2018_p, cchs2009_s, cchs2010_s, cchs2012_s","cchs2001_p::REPA_4B, cchs2003_p::REPC_4B, cchs2005_p::REPE_4B, cchs2007_2008_p::REP_4B, cchs2015_2016_p::INJ_020C, cchs2017_2018_p::INJ_020C, [REP_5C]","Repetitive strain injury","Health status","N/A",NA,"","2.2.0","2025-06-30","Variable metadata completed","",NA,"active",NA | ||
| "REP_5D","Repetitive strain injury - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SMK_09C (Years since stopped smoking daily - former daily) available as SMK_090 for 2015-2016 and 2017-2018
Summary
Adds smoking variable harmonization for CCHS cycles 2001-2023, extending coverage to Master files and introducing unified derived variable functions using a new 3-step architecture.
Variables added (19 new)
cigs_per_day,SMK_202,SMK_203,SMK_207SMK_09A,SMK_10_gate,SMK_10A_A,SMK_10A_B,SMK_10A_cont,quit_pathwaySMKDSTY,SMKDGSTP,SMKDGSTP_cont,SMKDVSTPSMK_01C,SMK_040,SMK_06C,SMK_09C,SMKG09C_contVariables updated (34 existing)
variableStartmappings for era-specific source variable namespack_years_cat,pack_years_der,time_quit_smoking,SMKDSTY_A,SMKDSTY_B,SMKDSTY_cat3,SMKDSTY_cat5,SMKG040,SMKG040_cont,SMKG203_A/B/cont,SMKG207_A/B/cont,SMK_005,SMK_01A,SMK_030,SMK_05B,SMK_05C,SMK_05D,SMK_06A_A/B/cont,SMK_09A_A/B/cont,SMK_204,SMK_208,SMKG01C_A/B/cont,SMKG06C,SMKG09CCycle coverage
3-step derived variable architecture
New modular pattern for derived variables:
clean_variables()preprocesses and validates inputsSupporting infrastructure added:
R/clean-variables.R,R/missing-data-functions.R,R/missing-pattern-cache.RR/worksheet-getters.R,R/worksheet-loaders.R,R/variable-discovery.RTests added
test-age_start_smoking.Rtest-cigs_per_day.Rtest-pack_years.Rtest-time_quit_smoking.RDocumentation
Full specification: CEP-002 Smoking Variables
Source QMD files included in
ceps/cep-002-smoking/Test plan
R CMD checktestthat::test_file("tests/testthat/test-pack_years.R")