-
Notifications
You must be signed in to change notification settings - Fork 0
Refactor lag significance functions and config handling #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Renamed internal bootstrapped significance functions for clarity, added config-driven wrappers, and improved result DataFrame construction. This refactor centralizes configuration resolution and streamlines selection logic for both statsmodels and numba engines.
WalkthroughRefactors lag significance to a config-driven flow: adds internal config resolver and underscored bootstrapped functions, adds public wrappers that build/resume config, updates select_lags to dispatch via these wrappers, changes result fields/sorting, and renames a CorrCache.delete parameter and SQL predicate from Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant SL as select_lags
participant WR as bootstrapped_from_config / fast_bootstrapped_from_config
participant RES as _resolve_lag_cfg
participant SLOW as _bootstrapped_significance
participant FAST as _fast_bootstrapped_significance
User->>SL: call select_lags(y, params)
SL->>WR: delegate based on engine (fast/slow)
WR->>RES: resolve LagConfig into concrete ints (n, block_len, max_lag, B, etc.)
RES-->>WR: resolved params
alt engine == "fast"
WR->>FAST: run fast bootstrap (uses RNG, computes p_base, p_fdr, freq, selected)
FAST-->>WR: results DataFrame (t_base, p_base, p_fdr, selected, freq, ...)
else engine == "slow"
WR->>SLOW: run slow bootstrap (resampling bootstrap, computes p-values and masks)
SLOW-->>WR: results DataFrame (t_base, p_base, p_fdr, selected, freq, ...)
end
WR-->>SL: DataFrame with resolved fields
SL->>SL: finalize ordering/sorting by ["selected","p_base","freq"]
SL-->>User: selected lags + metrics
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
CandleNet/cache/synergy_cache.py (1)
176-186: Rename parameter to match the column it targets.The method parameter is named
sectorsbut the SQL query now correctly targets thesectors_idcolumn. For clarity and consistency with other methods (insert,fetch), rename the parameter tosectors_id.Apply this diff:
- def delete(self, sectors: str) -> None: + def delete(self, sectors_id: str) -> None: con = self.check_con() query = f"""DELETE FROM {self.TABLE_NAME} WHERE sectors_id = ?;""" - con.execute(query, (sectors,)) + con.execute(query, (sectors_id,)) self._log( LogType.EVENT, OriginType.USER, CallerType.CACHE, - f"Deleted cache entry for sectors: {sectors}.", + f"Deleted cache entry for sectors: {sectors_id}.", )
🧹 Nitpick comments (3)
CandleNet/autoreg/lag_utils.py (3)
341-341: Fix ambiguous dash character in docstring.The docstring contains an EN DASH (–) instead of a HYPHEN-MINUS (-). While this is a minor issue, it's flagged by static analysis and should be corrected for consistency.
Apply this diff:
- - "hacBandwidth": Newey–West/HAC bandwidth or "auto". + - "hacBandwidth": Newey-West/HAC bandwidth or "auto".Based on static analysis hints.
695-720: Simplify redundant bandwidth resolution.Lines 707-711 check whether
hacBandwidthis "auto" and conditionally use the resolved value, but_resolve_lag_cfghas already resolved "auto" to an integer at line 361-362. You can simplify by directly passingr["bandwidth"].Apply this diff:
r = _resolve_lag_cfg(params, n) return _bootstrapped_significance( y, max_lag=r["max_lag"], B=r["B"], block_len=r["block_len"], - bandwidth=( - params["hacBandwidth"] - if params["hacBandwidth"] != "auto" - else r["bandwidth"] - ), + bandwidth=r["bandwidth"], alpha=params["sigLevel"], use_fdr_end=(params["selectionMethod"] == "fdrAdjusted"), min_freq=params["stabilityFreq"] if params["requireStability"] else 0.0,
723-748: Simplify redundant bandwidth resolution.Lines 735-739 have the same redundant check as in
bootstrapped_from_config. Since_resolve_lag_cfgalready resolves "auto" values, directly passr["bandwidth"].Apply this diff:
r = _resolve_lag_cfg(params, n) return _fast_bootstrapped_significance( y, max_lag=r["max_lag"], B=r["B"], block_len=r["block_len"], - bandwidth=( - params["hacBandwidth"] - if params["hacBandwidth"] != "auto" - else r["bandwidth"] - ), + bandwidth=r["bandwidth"], alpha=params["sigLevel"], use_fdr_end=(params["selectionMethod"] == "fdrAdjusted"), min_freq=params["stabilityFreq"] if params["requireStability"] else 0.0,
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
CandleNet/autoreg/lag_utils.py(5 hunks)CandleNet/cache/synergy_cache.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
CandleNet/autoreg/lag_utils.py (4)
CandleNet/autoreg/fastlag/select.py (2)
_auto_nw_bandwidth(83-96)_is_int_like(11-15)CandleNet/_config.py (1)
LagConfig(27-48)CandleNet/autoreg/fastlag/engine.py (1)
_ols_hac_beta_t_vectorized(6-77)CandleNet/__init__.py (1)
lag_config(10-17)
🪛 Ruff (0.13.3)
CandleNet/cache/synergy_cache.py
179-179: Possible SQL injection vector through string-based query construction
(S608)
CandleNet/autoreg/lag_utils.py
341-341: Docstring contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: lint
🔇 Additional comments (3)
CandleNet/autoreg/lag_utils.py (3)
396-561: LGTM! Appropriate conversion to internal function.The rename to
_bootstrapped_significance(underscore prefix) correctly signals this is now an internal implementation detail, with the newbootstrapped_from_configwrapper providing the public interface. The sorting logic remains correct.
564-692: LGTM! Consistent refactoring to internal function.The rename to
_fast_bootstrapped_significanceis appropriate. The DataFrame column ordering is consistent between the early-return and main-path results, and the sorting logic correctly uses theby=parameter.
751-845: LGTM! Cleaner separation of concerns.The refactor to use
bootstrapped_from_configandfast_bootstrapped_from_configwrappers simplifiesselect_lagsby delegating configuration resolution to dedicated functions. This improves maintainability and reduces duplication.
Replaces _auto_block_len with _infer_block_len in lag_utils.py and updates parameter naming in CorrCache.delete for clarity. These changes improve code readability and maintain consistency in function usage and logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
CandleNet/autoreg/lag_utils.py (1)
682-707: Minor code duplication withfast_bootstrapped_from_config.Both
bootstrapped_from_configandfast_bootstrapped_from_config(lines 710-735) follow nearly identical patterns:
- Get config via
lag_config()- Compute series length and RNG
- Resolve config with
_resolve_lag_cfg- Call the underscored function
Consider extracting the common setup into a helper function to reduce duplication.
Example:
def _prepare_config_and_rng(y: pd.Series) -> tuple[dict, dict, np.random.Generator]: """Prepare resolved config and RNG from global config and series.""" params = lag_config() n = len(y) rand_seed = params.get("randomSeed") rng = np.random.default_rng(rand_seed) r = _resolve_lag_cfg(params, n) return params, r, rngThen both wrappers could use:
def bootstrapped_from_config(y: pd.Series) -> pd.DataFrame: params, r, rng = _prepare_config_and_rng(y) return _bootstrapped_significance( y, max_lag=r["max_lag"], # ... rest of parameters )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
CandleNet/autoreg/lag_utils.py(5 hunks)CandleNet/cache/synergy_cache.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
CandleNet/cache/synergy_cache.py (4)
CandleNet/cache/p2_cache.py (2)
delete(60-72)TABLE_NAME(89-90)CandleNet/cache/base_cache.py (4)
delete(62-63)check_con(125-130)TABLE_NAME(93-94)_log(69-72)CandleNet/cache/ticker_cache.py (2)
delete(153-163)TABLE_NAME(190-191)CandleNet/logger/logger_types.py (3)
LogType(4-8)OriginType(11-13)CallerType(16-23)
CandleNet/autoreg/lag_utils.py (4)
CandleNet/autoreg/fastlag/select.py (3)
_auto_nw_bandwidth(83-96)_infer_block_len(99-101)_is_int_like(11-15)CandleNet/_config.py (1)
LagConfig(27-48)CandleNet/autoreg/fastlag/engine.py (1)
_ols_hac_beta_t_vectorized(6-77)CandleNet/__init__.py (1)
lag_config(10-17)
🪛 Ruff (0.13.3)
CandleNet/cache/synergy_cache.py
179-179: Possible SQL injection vector through string-based query construction
(S608)
CandleNet/autoreg/lag_utils.py
328-328: Docstring contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF002)
🔇 Additional comments (4)
CandleNet/cache/synergy_cache.py (1)
176-186: LGTM! Consistent parameter naming.The rename from
sectorstosectors_idaligns with the database schema (line 207) and matches the naming used ininsert(line 123) andfetch(line 144) methods. The SQL query correctly uses parameterized binding with the?placeholder, making it safe from injection attacks.Note: The static analysis warning (S608) about SQL injection is a false positive—the query is properly parameterized.
CandleNet/autoreg/lag_utils.py (3)
547-548: Verify the sorting key change is intentional.The result sorting changed from
by=["selected", "freq", "top_freq"]toby=["selected", "p_base", "freq"]. This prioritizes base-sample p-values over the frequency of being the top-ranked lag, which affects how equally-stable lags are ordered in the output.Ensure this change aligns with the intended selection logic and user expectations.
551-679: LGTM! Improved consistency between fast and slow paths.The updates align the fast implementation with the slow version:
- Added FDR handling (lines 600-603)
- Expanded DataFrame output to include all fields (lines 661-673)
- Unified sorting logic (lines 677-679)
This ensures both engines produce compatible results.
766-768: LGTM! Clean integration of config-driven wrappers.The updated calls to
bootstrapped_from_configandfast_bootstrapped_from_configsimplifyselect_lagsby delegating config resolution to the wrapper functions. This improves separation of concerns and reduces coupling.
| def _resolve_lag_cfg(params: LagConfig, n: int) -> dict: | ||
| # max_lag tested | ||
| """ | ||
| Resolve a LagConfig mapping into concrete numeric parameters used for lag testing and bootstrapping. | ||
|
|
||
| This converts potentially symbolic or "auto" entries in `params` into integer values appropriate for | ||
| a series of length `n`, applying sensible bounds and heuristics where needed. | ||
|
|
||
| Parameters: | ||
| params (LagConfig): Configuration mapping containing keys: | ||
| - "maxLag": maximum lag to consider (may be numeric or "auto"-like value). | ||
| - "hacBandwidth": Newey–West/HAC bandwidth or "auto". | ||
| - "blockLen": circular block bootstrap block length or "auto". | ||
| - "bootstrapSamples": number of bootstrap replicates or "auto". | ||
| - "maxLagsSelected": cap on number of selected lags or "auto". | ||
| - "minBootstrapSamples", "minLagsSelected": minimums used when resolving "auto". | ||
| n (int): Length of the time series; used to clamp and derive data-dependent defaults. | ||
|
|
||
| Returns: | ||
| dict: A mapping with integer-valued keys: | ||
| - "max_lag": selected max lag (clamped to at least 1 and at most n-2). | ||
| - "bandwidth": resolved HAC bandwidth as an int. | ||
| - "block_len": resolved block length for CBB as an int. | ||
| - "B": number of bootstrap replicates as an int. | ||
| - "max_selected": maximum number of lags to retain as an int (>= 0). | ||
| """ | ||
| max_lag = int(params["maxLag"]) | ||
| max_lag = max(1, min(max_lag, n - 2)) | ||
|
|
||
| # bandwidth | ||
| bw = params["hacBandwidth"] | ||
| if isinstance(bw, str) and bw == "auto": | ||
| bw = _auto_nw_bandwidth(n) | ||
|
|
||
| # block length | ||
| bl = params["blockLen"] | ||
| if isinstance(bl, str) and bl == "auto": | ||
| bl = _infer_block_len(n) | ||
|
|
||
| # bootstrap samples | ||
| B = params["bootstrapSamples"] | ||
| if isinstance(B, str) and B == "auto": | ||
| # heuristic: proportional to tested lags, capped | ||
| B = max(params["minBootstrapSamples"], min(300, 20 * max_lag)) | ||
|
|
||
| # max lags selected | ||
| msel_cfg = params["maxLagsSelected"] | ||
| if isinstance(msel_cfg, str) and msel_cfg == "auto": | ||
| msel = max(params["minLagsSelected"], min(5, max_lag)) | ||
| else: | ||
| msel = msel_cfg | ||
|
|
||
| assert _is_int_like(msel) and msel >= 0, ( | ||
| f"Unsupported maxLagsSelected: {msel_cfg}. " | ||
| f"Must be a non-negative integer or 'auto'." | ||
| ) | ||
|
|
||
| return { | ||
| "max_lag": max_lag, | ||
| "bandwidth": int(bw), | ||
| "block_len": int(bl), | ||
| "B": int(B), | ||
| "max_selected": int(msel), | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix EN DASH character in docstring.
The docstring contains an EN DASH (–) character on line 328 in "Newey–West/HAC". Use a regular hyphen-minus (-) instead for better compatibility and to satisfy linting rules.
Apply this diff:
- params (LagConfig): Configuration mapping containing keys:
- - "maxLag": maximum lag to consider (may be numeric or "auto"-like value).
- - "hacBandwidth": Newey–West/HAC bandwidth or "auto".
+ params (LagConfig): Configuration mapping containing keys:
+ - "maxLag": maximum lag to consider (may be numeric or "auto"-like value).
+ - "hacBandwidth": Newey-West/HAC bandwidth or "auto".Note: The past review comment about duplicate _auto_block_len logic appears to have been addressed—this function now calls _infer_block_len(n) at line 354 instead of duplicating the logic.
Based on past review comments.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _resolve_lag_cfg(params: LagConfig, n: int) -> dict: | |
| # max_lag tested | |
| """ | |
| Resolve a LagConfig mapping into concrete numeric parameters used for lag testing and bootstrapping. | |
| This converts potentially symbolic or "auto" entries in `params` into integer values appropriate for | |
| a series of length `n`, applying sensible bounds and heuristics where needed. | |
| Parameters: | |
| params (LagConfig): Configuration mapping containing keys: | |
| - "maxLag": maximum lag to consider (may be numeric or "auto"-like value). | |
| - "hacBandwidth": Newey–West/HAC bandwidth or "auto". | |
| - "blockLen": circular block bootstrap block length or "auto". | |
| - "bootstrapSamples": number of bootstrap replicates or "auto". | |
| - "maxLagsSelected": cap on number of selected lags or "auto". | |
| - "minBootstrapSamples", "minLagsSelected": minimums used when resolving "auto". | |
| n (int): Length of the time series; used to clamp and derive data-dependent defaults. | |
| Returns: | |
| dict: A mapping with integer-valued keys: | |
| - "max_lag": selected max lag (clamped to at least 1 and at most n-2). | |
| - "bandwidth": resolved HAC bandwidth as an int. | |
| - "block_len": resolved block length for CBB as an int. | |
| - "B": number of bootstrap replicates as an int. | |
| - "max_selected": maximum number of lags to retain as an int (>= 0). | |
| """ | |
| max_lag = int(params["maxLag"]) | |
| max_lag = max(1, min(max_lag, n - 2)) | |
| # bandwidth | |
| bw = params["hacBandwidth"] | |
| if isinstance(bw, str) and bw == "auto": | |
| bw = _auto_nw_bandwidth(n) | |
| # block length | |
| bl = params["blockLen"] | |
| if isinstance(bl, str) and bl == "auto": | |
| bl = _infer_block_len(n) | |
| # bootstrap samples | |
| B = params["bootstrapSamples"] | |
| if isinstance(B, str) and B == "auto": | |
| # heuristic: proportional to tested lags, capped | |
| B = max(params["minBootstrapSamples"], min(300, 20 * max_lag)) | |
| # max lags selected | |
| msel_cfg = params["maxLagsSelected"] | |
| if isinstance(msel_cfg, str) and msel_cfg == "auto": | |
| msel = max(params["minLagsSelected"], min(5, max_lag)) | |
| else: | |
| msel = msel_cfg | |
| assert _is_int_like(msel) and msel >= 0, ( | |
| f"Unsupported maxLagsSelected: {msel_cfg}. " | |
| f"Must be a non-negative integer or 'auto'." | |
| ) | |
| return { | |
| "max_lag": max_lag, | |
| "bandwidth": int(bw), | |
| "block_len": int(bl), | |
| "B": int(B), | |
| "max_selected": int(msel), | |
| } | |
| params (LagConfig): Configuration mapping containing keys: | |
| - "maxLag": maximum lag to consider (may be numeric or "auto"-like value). | |
| - "hacBandwidth": Newey-West/HAC bandwidth or "auto". |
🧰 Tools
🪛 Ruff (0.13.3)
328-328: Docstring contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF002)
🤖 Prompt for AI Agents
CandleNet/autoreg/lag_utils.py around lines 317-380: the docstring at line ~328
uses an EN DASH in "Newey–West/HAC"; replace the EN DASH (–) with a regular
hyphen-minus ("-") so it reads "Newey-West/HAC", save the file, and run the
linter to confirm the character issue is resolved.
Renamed internal bootstrapped significance functions for clarity, added config-driven wrappers, and improved result DataFrame construction. This refactor centralizes configuration resolution and streamlines selection logic for both statsmodels and numba engines.
Summary by CodeRabbit
New Features
Bug Fixes
Refactor