-
Notifications
You must be signed in to change notification settings - Fork 7
v3.0.0 infrastructure with CCHS master file harmonization - Yulric Version #143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
yulric
wants to merge
66
commits into
favicon-fixes
Choose a base branch
from
feature-v3.0.0-validation-infrastructure-yulric
base: favicon-fixes
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
v3.0.0 infrastructure with CCHS master file harmonization - Yulric Version #143
yulric
wants to merge
66
commits into
favicon-fixes
from
feature-v3.0.0-validation-infrastructure-yulric
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
5f24362 to
852bcc9
Compare
852bcc9 to
50b6800
Compare
- Add schema validation system with cross-platform compatibility - Add CSV standardization tools for git collaboration - Add metadata schemas for variables and variable_details - Foundation for v2.2.0 enhancements
- Add 28 new variables with full metadata - Enhance 91 existing variables with _i cycle database support - Add systematic version tracking for all variables - Maintain backward compatibility
- Add 3 new functions: DemPoRT_ICES_code.R, adl_score_6.R, missing-data-helpers.R - Major enhancement to smoking.R (1547 changes) - improved _i cycle support - Substantial updates to bmi.R (509 changes) - enhanced database compatibility - Significant improvements to adl.R (264 changes) - expanded functionality - Enhanced alcohol.R (193 changes) - better cycle support - Updated utility functions for v2.2.0 compatibility
- Add test-csv-helpers.R for CSV standardization validation - Add test-yaml-validation.R for schema testing - Add test-dependency-helpers.R for dependency analysis - Add test-missing-data-helpers.R for missing data handling - Enhance helper-utils.R with v2.2.0 testing infrastructure - Add CHANGELOG_v2.2.0.md documenting all enhancements
- Update DESCRIPTION to version 2.2.0 with current date - Add yaml and readr dependencies for validation infrastructure - Remove DemPoRT_ICES_code.R (not needed for this release) - Package ready for comprehensive testing and validation
- Change title from "Recodeflow Schema Validation System" to "Schema Validation" - Update @name from "recodeflow_schema_validation" to "schema_validation" - Generalize description for broader applicability - Bug fixes for required field extraction as noted in session status
- Update variable_details.csv with 3,577 comprehensive entries - Update variables.csv with enhanced metadata tracking - Add version tracking, harmonization status, and review notes - Implement structured metadata framework for v2.2.0
- Update BMI functions (bmi_fun, adjusted_bmi_fun) with v2.2.0 @note metadata - Update ADL functions (adl_fun, adl_score_5_fun, adl_score_6_fun) with versioning - Update alcohol functions (ALCDTTM, binge_drinker_fun, low_drink_score_fun, ALCDTYP_A) with metadata - Update smoking functions (SMKDSTY_fun, time_quit_smoking_fun, smoke_simple_fun, pack_years_fun, pack_years_fun_cat) with versioning - All 14 functions include machine-readable @note format: v2.2.0, last updated: 2025-06-30, status: active
- Update schema validation with improved required field extraction - Enhance templates.yaml with comprehensive versioning framework - Add metadata validation utilities for function versioning - Improve error handling and validation consistency
- Update @note metadata in all 14 versioned functions to v3.0.0 - Rename CHANGELOG_v2.2.0.md to CHANGELOG_v3.0.0.md - Update schema files with v3.0.0 versioning - Reflect major version due to breaking changes (_s deprecation, function modernization)
- Create modern tidyverse development vignette with v3.0.0 patterns - Document copy-paste functionality across scalar, vector, and rec_with_table() contexts - Include complex case_when patterns with missing data handling examples - Add comprehensive input validation and data checking framework - Provide complete documentation standards with transformation warnings - Establish function versioning system with structured @note metadata - Based on smoking function modernization as reference implementation
…ntions - Update BMI functions (bmi_fun, adjusted_bmi_fun, bmi_fun_cat) to standard roxygen2 patterns - Update alcohol binge_drinker_fun documentation following community standards - Remove custom formatting (bold headings, non-standard sections) - Add mandatory rec_with_table() examples as primary usage pattern - Standardize @return documentation with itemized missing data handling - Convert transformation warnings to plain text @details sections - Preserve legacy functions in backup files for validation - Document identified issues for team discussion 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
…ace issue - Update low_drink_short_fun and low_drink_long_fun to R/Tidyverse standards - Remove custom formatting and add mandatory rec_with_table() examples - Fix critical namespace issue: tagged_na() → haven::tagged_na() in physical activity functions - Standardize @return documentation with itemized missing data handling - Add comprehensive @examples, @Seealso, and @references sections 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Mark physical activity namespace issue as completed - Mark function organization strategy as completed - Add documentation standardization completion status - Update priority tracking for remaining items 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Rename BMI functions: bmi_fun → calculate_bmi, adjusted_bmi_fun → adjust_bmi, bmi_fun_cat → categorize_bmi - Rename ADL functions: adl_fun → assess_adl, adl_score_5_fun → score_adl, adl_score_6_fun → score_adl_6 - Rename alcohol functions: binge_drinker_fun → assess_binge_drinking, low_drink_short_fun → assess_drinking_risk_short, low_drink_long_fun → assess_drinking_risk_long - Rename physical activity: energy_exp_fun → calculate_energy_expenditure - Update all internal function references and @Seealso links - Update development guide with naming standards and migration mapping - Follow verb-first naming pattern: calculate_*, assess_*, categorize_*, score_* 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Update variable_details.csv with all new function names: • Func::adl_fun → Func::assess_adl • Func::adl_score_5_fun → Func::score_adl • Func::adl_score_6_fun → Func::score_adl_6 • Func::binge_drinker_fun → Func::assess_binge_drinking • Func::low_drink_short_fun → Func::assess_drinking_risk_short • Func::low_drink_long_fun → Func::assess_drinking_risk_long • Func::energy_exp_fun → Func::calculate_energy_expenditure • Func::bmi_fun → Func::calculate_bmi • Func::adjusted_bmi_fun → Func::adjust_bmi • Func::bmi_fun_cat → Func::categorize_bmi - Rename test files: test-bmi-enhanced.R → test-calculate-bmi.R, test-adl-enhanced.R → test-assess-adl.R - Update all function calls in test files to use new naming conventions - Maintain consistency across metadata, functions, and tests 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Create deprecated aliases for all 11 renamed functions - Include comprehensive deprecation warnings with migration guidance - Functions will be removed in v4.0.0 - Maintains full backward compatibility during v3.x series
- Remove incorrect source() calls from enhanced test files - Enhanced functions loaded via devtools::load_all() - Allows tests to run in proper package environment
- Regenerate NAMESPACE with new function exports - Update all function documentation with new names - Add documentation for new modernized functions - Remove documentation for old energy_exp_fun - Include generated vignette HTML
- Add @note version metadata to calculate_energy_expenditure function - Add version validation tests to enhanced test files following development guide - Update variable_details.csv with v3.0.0 versioning for renamed functions - Include semantic versioning validation in tests - Test @note metadata format compliance (vX.Y.Z, YYYY-MM-DD, status)
- Add smoking constants analysis to ISSUES_TO_DISCUSS.md with Holford/Manuel research findings - Create comprehensive integration test suite for all CCHS cycles (2001-2018) - Enhance function documentation with scalar/vector/rec_with_table() examples - Fix schema validation documentation issue preventing R CMD check - Test all modernized functions: calculate_bmi, assess_adl, assess_binge_drinking, calculate_energy_expenditure - Integration tests verify rec_with_table() compatibility across all available cycles - Document baseline R CMD check status (4 NOTEs, 0 ERRORs for new functions)
- Update _pkgdown.yml with modernized function names (calculate_bmi, assess_adl, etc.) - Add dedicated sections for deprecated aliases and internal helper functions - Mark 41 internal schema validation and smoking helper functions with @Keywords internal - Successfully build complete pkgdown reference documentation - Organize functions into logical groups: derived variables, deprecated aliases, internal helpers All v3.0.0 modernized functions now properly documented in website with enhanced examples for scalar, vector, and rec_with_table() usage patterns.
- Add missing space in 'expenditure. A MET' for proper formatting - Completes medium priority documentation cleanup
This commit adds the files in /Extra, originally created by Maikol Diasparra (https://github.com/mdiaspar) and merged into the cchsflow-temp repo via PR#3: Big-Life-Lab/cchsflow-temp#3. The files contain a detailed review and analysis of smoking-derived variables for the planned Canadian Smoking History Generator model. This includes: • Analyses of CCHS Master data from 2001 to 2004. • Updates to smoking-derived variables for CCHS 2024. • Methodological notes to support model development. Co-authored-by: Maikol Diasparra <mdiaspar@users.noreply.github.com>"
Co-authored-by: Maikol Diasparra <mdiaspar@users.noreply.github.com>"
- Restored corrupted BMI entries from backup files
- Fixed deprecated function references: DerivedVar::HWTDBMI_der_cat4 → DerivedVar::[HWTDBMI_der]
- Corrected unit inconsistencies: kg/m3-9 → kg/m2
- Set HWTDBMI_der_cat4 status to discontinued (replaced by range-based approach)
- Added new HWTGBMI_cat4 variable with proper BMI range notation:
* Category 1: [0,18.5) Underweight
* Category 2: [18.5,25) Normal weight
* Category 3: [25,30) Overweight
* Category 4: [30,Inf) Obese
- Updated both variable_details.csv and .RData for consistency
- All BMI categorical tests passing (25/25)
Deletes unused binary files to clean up the repository and reduce unnecessary clutter. These files are no longer needed as part of the application's data management strategy.
- Replace .GlobalEnv caching with local constants in age_started_*_core() - Remove fragile source() calls and metadata loading - Use hardcoded midpoints as temporary solution - TODO: Move constants to variable_details.csv in next iteration Functions tested and working correctly.
- Remove dangerous .GlobalEnv caching from calculate_time_quit_core() - Remove dangerous .GlobalEnv caching from pack_years_cat_core() - Replace fragile source() calls with local constants - Use literature-based constants (CVD Risk Tool, Manuel et al. 2018) - All functions tested and working correctly Result: Zero GlobalEnv pollution, no external file dependencies TODO: Move constants to variable_details.csv in future iteration
… style) - Update calculate_age_started_daily_current() to use clean_variables() - Update calculate_age_started_daily_former() to use clean_variables() - Standardize on single_digit_missing pattern for simplicity - Follow BMI architecture pattern for consistency - Functions tested and working correctly Note: These functions may become redundant after CSV-driven modernization like BMI cat functions were removed, but needed for scaffolding.
This commit addresses several key improvements: - Corrected the categorical-to-continuous mappings for SMKG09C and SMK_09A_B in variable_details.csv and variable_details.RData. This fixes a critical bug where legacy business logic was lost during previous modernization efforts. - Updated catLabel and catLabelLong fields for these smoking variables to be more informative. - Refactored R/bmi.R to move hard-coded correction coefficients into the adjust_bmi function signature, improving modularity and testability. - Added new sections on 'Recoding Spectrum' and 'The Role of the Helper' to vignettes/derived_variables_development.qmd to document our architectural decisions."
Total Additions: 46 rows Variables Created: 1. SMK_09A_B_cont - Time since stopped smoking daily (former daily smokers) 2. SMKG09C_cont - Years since stopped smoking daily (former daily smokers) 3. SMKG203_A_cont - Age started smoking daily (current daily smokers) 4. SMKG207_A_cont - Age started smoking daily (former daily smokers) Mapping Types Implemented: Categorical → Continuous Mappings (27 rows): - SMK_09A_B_cont: 1→0.5, 2→1.5, 3→2.5 years - SMKG09C_cont: 1→4, 2→8, 3→12 years - SMKG203_A_cont: 1→8, 2→13, 3→16, 4→18.5, 5→22, 6→27, 7→32, 8→37, 9→42, 10→47 years - SMKG207_A_cont: Same age mappings as SMKG203_A_cont Continuous → Continuous Copy Operations (7 rows): - cchs2022_i: SPU_25I → smoking variables (proper cont-to-cont mapping) - Other databases: Range validation copies with [0,80] valid range NA Value Mappings (12 rows): - NA::a (not applicable): recStart values like 6, 996 - NA::b (missing): recStart values like [7,9], [997,999], else - Unique handling with _7 and _e suffixes for multiple rules per NA category 🗃️ Database Coverage: - cchs2003_p through cchs2023_i - Comprehensive CCHS cycle coverage - Public (p), Shared (s), ICES (i) - All database types supported - Special handling for cchs2022_i - Uses SPU_25I continuous source ✨ Key Features Implemented: - Smart dummyVariable naming with recStart identifiers (_7, _e) - Range validation entries for all continuous variables [0,80] - Consistent variableStartShortLabel system (stpd_cat, stpdy_cont, etc.) - Clean notes field with special characters removed - Zero duplicates - all variable+database+source combinations unique 🎨 DummyVariable Patterns Created: - Categorical: SMK_09A_B_cont_05, SMKG09C_cont_4 - Copy cchs2022_i: SMK_09A_B_copy_cont_cchs2022_i - Copy others: SMKG203_A_cont_copy - NA mappings: SMK_09A_B_cont_NAb_cchs2003_p_7, SMKG09C_cont_NAa_cchs2022_i
Smoking Status Function Reorganization: - Add comprehensive documentation and examples Test Suite Added: - Add 6 new test functions covering all SMKDSTY_A categories (1-6) - Test missing data handling (tagged_na patterns) - Test vector input processing and edge cases - Add CCHS codebook validation tests - Add legacy compatibility tests with detailed descriptions Bug Fix - Legacy Compatibility: - Fix condition order for "Never smoked" classification - SMK_005=3 & SMK_01A=2 → category 6 (regardless of SMK_030) - All smoking status tests pass (139 total test assertions) - Maintains 100% legacy compatibility for smoking history generator models
- Smoking status functions (SMKDSTY_A, SMKDSTY_B, SMKDSTY_cat5, SMKDSTY_cat3) are complete - All 148 tests passing - Enhanced roxygen examples for all smoking status functions with rec_with_table() workflows - Added missing data and edge case examples showing CCHS code handling - Fixed smoke_simple boundary condition for 5-year threshold - Updated assessment documentation and working guide with comprehensive examples
- work-in-progress for adding smoking initiation to smoking.R - clean variable_details.csv for these variable. More cleaning needed.
updated function working. Tests all working.
Update recFrom and recTo. recFrom usually doesn't have a defined range. rectTo defined from Smoking History Generator models.
- Add regex constraints for recEnd field validation (prevents issues like "5+" in categorical data) - Document proper recStart N/A usage guidelines for derived variables - Add CCHS-specific data consistency requirements - Update variable_details.csv schema compliance - Establish standardized formatting rules for categorical values
- New validate_csv_comprehensive() function for structured validation checks - R CMD check style output with clear pass/fail status indicators - Three-layer validation system (basic, verbose, full investigation) - Complete usage examples and integration documentation - Helper functions and dependencies for team workflows - Ready for development team adoption and testing Enables teams to validate variable_details.csv and variables.csv files with consistent, reliable feedback for data quality assurance.
I believe this was a typo Doug had pointed out previously that he wanted me to take a look into. The 2015-16 ADL_01 variable was reformatted to a 4 category variable in these survey cycles so their inclusion in these rows would be incorrect. Additional rows had been added to variable_details to account for their harmonization back to a two category variable as well as rows to have them coded as their original 4-category variable. There also should not be 2015-2016 or 2017-18 master cycles for these variables as the module wasn't mandatory and therefore not collected in Ontario for those cycles.
50b6800 to
fbac1db
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is a copy of #137. The purpose of this PR is to gradually merge the work done in the original PR into dev in a way that is more easily reviewable. This will be done by:
This should hopefully make it easier to review the changes done in the original branch.
Information:
TODO: