Skip to content

Conversation

@GongJr0
Copy link
Owner

@GongJr0 GongJr0 commented Oct 4, 2025

Renamed internal bootstrapped significance functions for clarity, added config-driven wrappers, and improved result DataFrame construction. This refactor centralizes configuration resolution and streamlines selection logic for both statsmodels and numba engines.

Summary by CodeRabbit

  • New Features

    • Added configuration-driven bootstrap analysis with standard and fast modes and new result fields (e.g., base statistics, FDR-adjusted p-values, selection flags).
    • Public wrappers to run bootstrap flows from config.
  • Bug Fixes

    • Fixed cache deletion to target the correct sector identifier, preventing stale entries.
  • Refactor

    • Unified lag selection to use the config-driven flow and updated result sorting to prioritize base p-values and frequency for clearer interpretation.

Renamed internal bootstrapped significance functions for clarity, added config-driven wrappers, and improved result DataFrame construction. This refactor centralizes configuration resolution and streamlines selection logic for both statsmodels and numba engines.
@GongJr0 GongJr0 self-assigned this Oct 4, 2025
@GongJr0 GongJr0 added this to CandleNet Oct 4, 2025
@GongJr0 GongJr0 added the enhancement New feature or request label Oct 4, 2025
@GongJr0 GongJr0 moved this to In review in CandleNet Oct 4, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 4, 2025

Walkthrough

Refactors lag significance to a config-driven flow: adds internal config resolver and underscored bootstrapped functions, adds public wrappers that build/resume config, updates select_lags to dispatch via these wrappers, changes result fields/sorting, and renames a CorrCache.delete parameter and SQL predicate from sectors to sectors_id.

Changes

Cohort / File(s) Summary
Autoreg lag utils refactor
CandleNet/autoreg/lag_utils.py
Added _resolve_lag_cfg(params: LagConfig, n: int) and _infer_block_len(n: int) helpers; renamed bootstrapped_significance_bootstrapped_significance and fast_bootstrapped_significance_fast_bootstrapped_significance; added public wrappers bootstrapped_from_config(y: pd.Series) and fast_bootstrapped_from_config(y: pd.Series) that build params, resolve config, seed RNG, and call underscored implementations; updated select_lags to dispatch to these wrappers; changed result DataFrame columns (e.g., added t_base, p_fdr, selected) and final sorting keys to ["selected","p_base","freq"].
Cache delete clause change
CandleNet/cache/synergy_cache.py
In CorrCache.delete, parameter renamed from sectorssectors_id and the SQL WHERE clause updated from sectors = ? to sectors_id = ?; logging message updated accordingly.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant SL as select_lags
  participant WR as bootstrapped_from_config / fast_bootstrapped_from_config
  participant RES as _resolve_lag_cfg
  participant SLOW as _bootstrapped_significance
  participant FAST as _fast_bootstrapped_significance

  User->>SL: call select_lags(y, params)
  SL->>WR: delegate based on engine (fast/slow)
  WR->>RES: resolve LagConfig into concrete ints (n, block_len, max_lag, B, etc.)
  RES-->>WR: resolved params
  alt engine == "fast"
    WR->>FAST: run fast bootstrap (uses RNG, computes p_base, p_fdr, freq, selected)
    FAST-->>WR: results DataFrame (t_base, p_base, p_fdr, selected, freq, ...)
  else engine == "slow"
    WR->>SLOW: run slow bootstrap (resampling bootstrap, computes p-values and masks)
    SLOW-->>WR: results DataFrame (t_base, p_base, p_fdr, selected, freq, ...)
  end
  WR-->>SL: DataFrame with resolved fields
  SL->>SL: finalize ordering/sorting by ["selected","p_base","freq"]
  SL-->>User: selected lags + metrics
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

I hop through configs, seeds in paw,
Resolve the lags with careful law.
Fast or slow the tests align,
P-values sorted, selected fine.
Corr keys shifted — tidy and spry, carrot-coded by my eye. 🥕🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title clearly and concisely summarizes the primary refactor of lag significance functions and the associated config handling improvements, which aligns directly with the changeset’s objectives. It highlights the two main components of the update without extraneous detail and is sufficiently specific for team members to understand the scope at a glance.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch FeatureGen

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
CandleNet/cache/synergy_cache.py (1)

176-186: Rename parameter to match the column it targets.

The method parameter is named sectors but the SQL query now correctly targets the sectors_id column. For clarity and consistency with other methods (insert, fetch), rename the parameter to sectors_id.

Apply this diff:

-    def delete(self, sectors: str) -> None:
+    def delete(self, sectors_id: str) -> None:
         con = self.check_con()
 
         query = f"""DELETE FROM {self.TABLE_NAME} WHERE sectors_id = ?;"""
-        con.execute(query, (sectors,))
+        con.execute(query, (sectors_id,))
         self._log(
             LogType.EVENT,
             OriginType.USER,
             CallerType.CACHE,
-            f"Deleted cache entry for sectors: {sectors}.",
+            f"Deleted cache entry for sectors: {sectors_id}.",
         )
🧹 Nitpick comments (3)
CandleNet/autoreg/lag_utils.py (3)

341-341: Fix ambiguous dash character in docstring.

The docstring contains an EN DASH (–) instead of a HYPHEN-MINUS (-). While this is a minor issue, it's flagged by static analysis and should be corrected for consistency.

Apply this diff:

-            - "hacBandwidth": Newey–West/HAC bandwidth or "auto".
+            - "hacBandwidth": Newey-West/HAC bandwidth or "auto".

Based on static analysis hints.


695-720: Simplify redundant bandwidth resolution.

Lines 707-711 check whether hacBandwidth is "auto" and conditionally use the resolved value, but _resolve_lag_cfg has already resolved "auto" to an integer at line 361-362. You can simplify by directly passing r["bandwidth"].

Apply this diff:

     r = _resolve_lag_cfg(params, n)
     return _bootstrapped_significance(
         y,
         max_lag=r["max_lag"],
         B=r["B"],
         block_len=r["block_len"],
-        bandwidth=(
-            params["hacBandwidth"]
-            if params["hacBandwidth"] != "auto"
-            else r["bandwidth"]
-        ),
+        bandwidth=r["bandwidth"],
         alpha=params["sigLevel"],
         use_fdr_end=(params["selectionMethod"] == "fdrAdjusted"),
         min_freq=params["stabilityFreq"] if params["requireStability"] else 0.0,

723-748: Simplify redundant bandwidth resolution.

Lines 735-739 have the same redundant check as in bootstrapped_from_config. Since _resolve_lag_cfg already resolves "auto" values, directly pass r["bandwidth"].

Apply this diff:

     r = _resolve_lag_cfg(params, n)
     return _fast_bootstrapped_significance(
         y,
         max_lag=r["max_lag"],
         B=r["B"],
         block_len=r["block_len"],
-        bandwidth=(
-            params["hacBandwidth"]
-            if params["hacBandwidth"] != "auto"
-            else r["bandwidth"]
-        ),
+        bandwidth=r["bandwidth"],
         alpha=params["sigLevel"],
         use_fdr_end=(params["selectionMethod"] == "fdrAdjusted"),
         min_freq=params["stabilityFreq"] if params["requireStability"] else 0.0,
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4003831 and ca2acbf.

📒 Files selected for processing (2)
  • CandleNet/autoreg/lag_utils.py (5 hunks)
  • CandleNet/cache/synergy_cache.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
CandleNet/autoreg/lag_utils.py (4)
CandleNet/autoreg/fastlag/select.py (2)
  • _auto_nw_bandwidth (83-96)
  • _is_int_like (11-15)
CandleNet/_config.py (1)
  • LagConfig (27-48)
CandleNet/autoreg/fastlag/engine.py (1)
  • _ols_hac_beta_t_vectorized (6-77)
CandleNet/__init__.py (1)
  • lag_config (10-17)
🪛 Ruff (0.13.3)
CandleNet/cache/synergy_cache.py

179-179: Possible SQL injection vector through string-based query construction

(S608)

CandleNet/autoreg/lag_utils.py

341-341: Docstring contains ambiguous (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: lint
🔇 Additional comments (3)
CandleNet/autoreg/lag_utils.py (3)

396-561: LGTM! Appropriate conversion to internal function.

The rename to _bootstrapped_significance (underscore prefix) correctly signals this is now an internal implementation detail, with the new bootstrapped_from_config wrapper providing the public interface. The sorting logic remains correct.


564-692: LGTM! Consistent refactoring to internal function.

The rename to _fast_bootstrapped_significance is appropriate. The DataFrame column ordering is consistent between the early-return and main-path results, and the sorting logic correctly uses the by= parameter.


751-845: LGTM! Cleaner separation of concerns.

The refactor to use bootstrapped_from_config and fast_bootstrapped_from_config wrappers simplifies select_lags by delegating configuration resolution to dedicated functions. This improves maintainability and reduces duplication.

Replaces _auto_block_len with _infer_block_len in lag_utils.py and updates parameter naming in CorrCache.delete for clarity. These changes improve code readability and maintain consistency in function usage and logging.
@GongJr0 GongJr0 merged commit ccd931a into main Oct 4, 2025
6 of 7 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in CandleNet Oct 4, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
CandleNet/autoreg/lag_utils.py (1)

682-707: Minor code duplication with fast_bootstrapped_from_config.

Both bootstrapped_from_config and fast_bootstrapped_from_config (lines 710-735) follow nearly identical patterns:

  1. Get config via lag_config()
  2. Compute series length and RNG
  3. Resolve config with _resolve_lag_cfg
  4. Call the underscored function

Consider extracting the common setup into a helper function to reduce duplication.

Example:

def _prepare_config_and_rng(y: pd.Series) -> tuple[dict, dict, np.random.Generator]:
    """Prepare resolved config and RNG from global config and series."""
    params = lag_config()
    n = len(y)
    rand_seed = params.get("randomSeed")
    rng = np.random.default_rng(rand_seed)
    r = _resolve_lag_cfg(params, n)
    return params, r, rng

Then both wrappers could use:

def bootstrapped_from_config(y: pd.Series) -> pd.DataFrame:
    params, r, rng = _prepare_config_and_rng(y)
    return _bootstrapped_significance(
        y,
        max_lag=r["max_lag"],
        # ... rest of parameters
    )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ca2acbf and 474f172.

📒 Files selected for processing (2)
  • CandleNet/autoreg/lag_utils.py (5 hunks)
  • CandleNet/cache/synergy_cache.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
CandleNet/cache/synergy_cache.py (4)
CandleNet/cache/p2_cache.py (2)
  • delete (60-72)
  • TABLE_NAME (89-90)
CandleNet/cache/base_cache.py (4)
  • delete (62-63)
  • check_con (125-130)
  • TABLE_NAME (93-94)
  • _log (69-72)
CandleNet/cache/ticker_cache.py (2)
  • delete (153-163)
  • TABLE_NAME (190-191)
CandleNet/logger/logger_types.py (3)
  • LogType (4-8)
  • OriginType (11-13)
  • CallerType (16-23)
CandleNet/autoreg/lag_utils.py (4)
CandleNet/autoreg/fastlag/select.py (3)
  • _auto_nw_bandwidth (83-96)
  • _infer_block_len (99-101)
  • _is_int_like (11-15)
CandleNet/_config.py (1)
  • LagConfig (27-48)
CandleNet/autoreg/fastlag/engine.py (1)
  • _ols_hac_beta_t_vectorized (6-77)
CandleNet/__init__.py (1)
  • lag_config (10-17)
🪛 Ruff (0.13.3)
CandleNet/cache/synergy_cache.py

179-179: Possible SQL injection vector through string-based query construction

(S608)

CandleNet/autoreg/lag_utils.py

328-328: Docstring contains ambiguous (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF002)

🔇 Additional comments (4)
CandleNet/cache/synergy_cache.py (1)

176-186: LGTM! Consistent parameter naming.

The rename from sectors to sectors_id aligns with the database schema (line 207) and matches the naming used in insert (line 123) and fetch (line 144) methods. The SQL query correctly uses parameterized binding with the ? placeholder, making it safe from injection attacks.

Note: The static analysis warning (S608) about SQL injection is a false positive—the query is properly parameterized.

CandleNet/autoreg/lag_utils.py (3)

547-548: Verify the sorting key change is intentional.

The result sorting changed from by=["selected", "freq", "top_freq"] to by=["selected", "p_base", "freq"]. This prioritizes base-sample p-values over the frequency of being the top-ranked lag, which affects how equally-stable lags are ordered in the output.

Ensure this change aligns with the intended selection logic and user expectations.


551-679: LGTM! Improved consistency between fast and slow paths.

The updates align the fast implementation with the slow version:

  • Added FDR handling (lines 600-603)
  • Expanded DataFrame output to include all fields (lines 661-673)
  • Unified sorting logic (lines 677-679)

This ensures both engines produce compatible results.


766-768: LGTM! Clean integration of config-driven wrappers.

The updated calls to bootstrapped_from_config and fast_bootstrapped_from_config simplify select_lags by delegating config resolution to the wrapper functions. This improves separation of concerns and reduces coupling.

Comment on lines +317 to +380
def _resolve_lag_cfg(params: LagConfig, n: int) -> dict:
# max_lag tested
"""
Resolve a LagConfig mapping into concrete numeric parameters used for lag testing and bootstrapping.

This converts potentially symbolic or "auto" entries in `params` into integer values appropriate for
a series of length `n`, applying sensible bounds and heuristics where needed.

Parameters:
params (LagConfig): Configuration mapping containing keys:
- "maxLag": maximum lag to consider (may be numeric or "auto"-like value).
- "hacBandwidth": Newey–West/HAC bandwidth or "auto".
- "blockLen": circular block bootstrap block length or "auto".
- "bootstrapSamples": number of bootstrap replicates or "auto".
- "maxLagsSelected": cap on number of selected lags or "auto".
- "minBootstrapSamples", "minLagsSelected": minimums used when resolving "auto".
n (int): Length of the time series; used to clamp and derive data-dependent defaults.

Returns:
dict: A mapping with integer-valued keys:
- "max_lag": selected max lag (clamped to at least 1 and at most n-2).
- "bandwidth": resolved HAC bandwidth as an int.
- "block_len": resolved block length for CBB as an int.
- "B": number of bootstrap replicates as an int.
- "max_selected": maximum number of lags to retain as an int (>= 0).
"""
max_lag = int(params["maxLag"])
max_lag = max(1, min(max_lag, n - 2))

# bandwidth
bw = params["hacBandwidth"]
if isinstance(bw, str) and bw == "auto":
bw = _auto_nw_bandwidth(n)

# block length
bl = params["blockLen"]
if isinstance(bl, str) and bl == "auto":
bl = _infer_block_len(n)

# bootstrap samples
B = params["bootstrapSamples"]
if isinstance(B, str) and B == "auto":
# heuristic: proportional to tested lags, capped
B = max(params["minBootstrapSamples"], min(300, 20 * max_lag))

# max lags selected
msel_cfg = params["maxLagsSelected"]
if isinstance(msel_cfg, str) and msel_cfg == "auto":
msel = max(params["minLagsSelected"], min(5, max_lag))
else:
msel = msel_cfg

assert _is_int_like(msel) and msel >= 0, (
f"Unsupported maxLagsSelected: {msel_cfg}. "
f"Must be a non-negative integer or 'auto'."
)

return {
"max_lag": max_lag,
"bandwidth": int(bw),
"block_len": int(bl),
"B": int(B),
"max_selected": int(msel),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix EN DASH character in docstring.

The docstring contains an EN DASH (–) character on line 328 in "Newey–West/HAC". Use a regular hyphen-minus (-) instead for better compatibility and to satisfy linting rules.

Apply this diff:

-        params (LagConfig): Configuration mapping containing keys:
-            - "maxLag": maximum lag to consider (may be numeric or "auto"-like value).
-            - "hacBandwidth": Newey–West/HAC bandwidth or "auto".
+        params (LagConfig): Configuration mapping containing keys:
+            - "maxLag": maximum lag to consider (may be numeric or "auto"-like value).
+            - "hacBandwidth": Newey-West/HAC bandwidth or "auto".

Note: The past review comment about duplicate _auto_block_len logic appears to have been addressed—this function now calls _infer_block_len(n) at line 354 instead of duplicating the logic.

Based on past review comments.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _resolve_lag_cfg(params: LagConfig, n: int) -> dict:
# max_lag tested
"""
Resolve a LagConfig mapping into concrete numeric parameters used for lag testing and bootstrapping.
This converts potentially symbolic or "auto" entries in `params` into integer values appropriate for
a series of length `n`, applying sensible bounds and heuristics where needed.
Parameters:
params (LagConfig): Configuration mapping containing keys:
- "maxLag": maximum lag to consider (may be numeric or "auto"-like value).
- "hacBandwidth": NeweyWest/HAC bandwidth or "auto".
- "blockLen": circular block bootstrap block length or "auto".
- "bootstrapSamples": number of bootstrap replicates or "auto".
- "maxLagsSelected": cap on number of selected lags or "auto".
- "minBootstrapSamples", "minLagsSelected": minimums used when resolving "auto".
n (int): Length of the time series; used to clamp and derive data-dependent defaults.
Returns:
dict: A mapping with integer-valued keys:
- "max_lag": selected max lag (clamped to at least 1 and at most n-2).
- "bandwidth": resolved HAC bandwidth as an int.
- "block_len": resolved block length for CBB as an int.
- "B": number of bootstrap replicates as an int.
- "max_selected": maximum number of lags to retain as an int (>= 0).
"""
max_lag = int(params["maxLag"])
max_lag = max(1, min(max_lag, n - 2))
# bandwidth
bw = params["hacBandwidth"]
if isinstance(bw, str) and bw == "auto":
bw = _auto_nw_bandwidth(n)
# block length
bl = params["blockLen"]
if isinstance(bl, str) and bl == "auto":
bl = _infer_block_len(n)
# bootstrap samples
B = params["bootstrapSamples"]
if isinstance(B, str) and B == "auto":
# heuristic: proportional to tested lags, capped
B = max(params["minBootstrapSamples"], min(300, 20 * max_lag))
# max lags selected
msel_cfg = params["maxLagsSelected"]
if isinstance(msel_cfg, str) and msel_cfg == "auto":
msel = max(params["minLagsSelected"], min(5, max_lag))
else:
msel = msel_cfg
assert _is_int_like(msel) and msel >= 0, (
f"Unsupported maxLagsSelected: {msel_cfg}. "
f"Must be a non-negative integer or 'auto'."
)
return {
"max_lag": max_lag,
"bandwidth": int(bw),
"block_len": int(bl),
"B": int(B),
"max_selected": int(msel),
}
params (LagConfig): Configuration mapping containing keys:
- "maxLag": maximum lag to consider (may be numeric or "auto"-like value).
- "hacBandwidth": Newey-West/HAC bandwidth or "auto".
🧰 Tools
🪛 Ruff (0.13.3)

328-328: Docstring contains ambiguous (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF002)

🤖 Prompt for AI Agents
CandleNet/autoreg/lag_utils.py around lines 317-380: the docstring at line ~328
uses an EN DASH in "Newey–West/HAC"; replace the EN DASH (–) with a regular
hyphen-minus ("-") so it reads "Newey-West/HAC", save the file, and run the
linter to confirm the character issue is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant