Skip to content

Conversation

@devonjkohler
Copy link
Contributor

@devonjkohler devonjkohler commented Sep 9, 2025

Motivation and Context

This branch implements the anomaly detection model in MSstats+ allowing us to incorporate information about the underlying peak quality into the differential analysis. Currently only implemented directly into the Spectronaut converter, but provides functionalities to be implemented for any tool.

Changes

  • Added anomaly detection model including C++ code for isolation forest
  • Updated Spectronaut converter to provide users with the functionalities to call model.
  • Unit tests and documentation for above

Testing

Please describe any unit tests you added or modified to verify your changes.

Checklist Before Requesting a Review

  • I have read the MSstats contributing guidelines
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

Summary by CodeRabbit

  • New Features
    • Optional anomaly detection with configurable features, temporal directions, run order, model size, and parallel cores.
    • New public functions: MSstatsAnomalyScores and CheckDataHealth.
    • Spectronaut import: anomaly-scoring options, feature selection, and explicit excluded-from-quantification filter.
    • Anomaly metrics propagated through preprocessing and balanced-design steps; peak-quality columns retained.
  • Documentation
    • Added/updated help for new functions and anomaly-related parameters.
  • Tests
    • Extensive tests for anomaly scoring, Spectronaut converter options, and data-health checks.
  • Chores
    • Compiled performance improvements and OpenMP support.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 9, 2025

Walkthrough

Adds anomaly-detection and data-health capabilities: new Rcpp isolation-forest implementation (with OpenMP and R wrappers), R preprocessing/feature-engineering and validation utilities, propagation of anomaly_metrics through preprocessing and balanced-design, Spectronaut cleaning/converter options to compute AnomalyScores, new public functions MSstatsAnomalyScores and CheckDataHealth, expanded tests and documentation.

Changes

Cohort / File(s) Summary
Package manifest & registration
DESCRIPTION, NAMESPACE
Add Rcpp in Imports/LinkingTo, import parallel, enable dynamic library registration; export MSstatsAnomalyScores, CheckDataHealth; update Collate with new R files and RcppExports.R.
Core API & public funcs
R/MSstatsConvert_core_functions.R, man/MSstatsAnomalyScores.Rd, man/CheckDataHealth.Rd, man/MSstatsPreprocess.Rd, man/MSstatsBalancedDesign.Rd, man/MSstatsClean.Rd, man/SpectronauttoMSstatsFormat.Rd
Add anomaly_metrics param to MSstatsPreprocess/MSstatsBalancedDesign; add new exported functions MSstatsAnomalyScores and CheckDataHealth; update Rd docs for new params and behaviors.
Spectronaut cleaning & converter
R/clean_Spectronaut.R, R/converters_SpectronauttoMSstatsFormat.R, inst/tinytest/test_cleanRaw.R, inst/tinytest/test_converters_SpectronauttoMSstatsFormat.R
Add parameters for excludedFromQuantificationFilter, calculateAnomalyScores, anomalyModelFeatures/temporal, runOrder, model tuning and parallelism; wire validation and conditional anomaly-scoring pipeline; update/extend tests.
Anomaly feature engineering & model runner (R)
R/utils_anomaly_score.R, R/utils_dt_operations.R, R/utils_clean_features.R, R/utils_balanced_design.R, R/utils_classes.R
New .prepareSpectronautAnomalyInput, feature rolling helpers (.add_*), .runAnomalyModel; propagate and preserve anomaly_metrics through selection/summarization/formatting; modify selection to include anomaly columns.
Data-health utilities & tests
R/utils_data_health.R, inst/tinytest/test_utils_data_health.R
New checks: missingness, intensity distribution, per-feature SD/outliers/coverage, anomaly-skew; CheckDataHealth orchestration; unit tests added.
Validation & parameter checks
R/utils_checks.R, man/dot-validateMSstatsConverterParameters.Rd
New .validateMSstatsConverterParameters(config) implementing comprehensive parameter validation for converter and anomaly settings; added docs.
Rcpp / C++ backend & build
src/isolation_forest.cpp, src/RcppExports.cpp, R/RcppExports.R, src/Makevars, src/Makevars.win
Add C++ isolation-forest implementation and Rcpp wrappers/registration; enable OpenMP flags and link BLAS/LAPACK; expose calculate_anomaly_score to R.
Model tests & anomaly test suite
inst/tinytest/test_utils_anomaly_score.R
Add extensive tests for internal anomaly-feature calculators, input preparation, model behavior, reproducibility, and ranking/outlier expectations.
Misc docs & metadata
R/utils_MSstatsConvert.R, R/utils_annotation.R, R/utils_filtering.R, multiple man/* small updates
Roxygen/doc tweaks, rename/align parameter docs, docstring/usage updates across affected internal functions.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant Converter as SpectronauttoMSstatsFormat
  participant Validator as .validateMSstatsConverterParameters
  participant Cleaner as MSstatsClean/.cleanRawSpectronaut
  participant Preproc as MSstatsPreprocess/MSstatsBalancedDesign
  participant AnomPrep as .prepareSpectronautAnomalyInput
  participant ModelR as .runAnomalyModel / MSstatsAnomalyScores
  participant Cpp as calculate_anomaly_score (Rcpp)

  User->>Converter: call(..., calculateAnomalyScores?, anomalyModelFeatures, runOrder, n_trees, max_depth, numberOfCores, excludedFromQuantificationFilter)
  Converter->>Validator: validate config
  Validator-->>Converter: OK
  Converter->>Cleaner: MSstatsClean(intensity, calculateAnomalyScores, anomalyModelFeatures)
  Cleaner-->>Converter: cleaned data
  Converter->>Preproc: MSstatsPreprocess(..., anomaly_metrics) → MSstatsBalancedDesign(..., anomaly_metrics)
  Preproc-->>Converter: balanced data
  alt calculateAnomalyScores == TRUE
    Converter->>AnomPrep: build PSMs, scale, add temporal features
    AnomPrep-->>ModelR: enriched data (split by PSM)
    ModelR->>Cpp: per-PSM call calculate_anomaly_score(n_trees, max_depth)
    Cpp-->>ModelR: avg_depth, anomaly_score
    ModelR-->>Converter: attach AnomalyScores
  else
    Note over Converter: Skip anomaly scoring
  end
  Converter-->>User: MSstats-format data (with optional AnomalyScores)
Loading
sequenceDiagram
  autonumber
  participant User
  participant Health as CheckDataHealth
  participant DH as utils_data_health

  User->>Health: CheckDataHealth(input)
  Health->>DH: .checkMissing, .checkIntensityDistribution
  Health->>DH: .checkFeatureSD -> .checkFeatureOutliers -> .checkFeatureCoverage
  opt AnomalyScores present
    Health->>DH: .checkAnomalySkew
  end
  DH-->>Health: metrics, tables
  Health-->>User: list(feature_data, skew_results)
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Suggested reviewers

  • mstaniak

Pre-merge checks (2 warnings, 1 inconclusive)

❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Description Check ⚠️ Warning While the pull request description includes all the required template headings, the Changes section is overly high‐level and the Testing section remains a placeholder without details on the unit tests that were added or modified to verify the new anomaly detection features. Expand the Changes section with detailed bullet points that map to each component or file modified and complete the Testing section by enumerating the specific unit tests added or updated, including their purpose and scope.
Docstring Coverage ⚠️ Warning Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title Check ❓ Inconclusive The title “Feature anomaly model” touches on the broad theme of the changes but is too generic and does not succinctly convey the main contributions, such as the addition of an isolation‐forest anomaly detector, updates to the Spectronaut converter, and new data health utilities. Revise the title to clearly summarize the primary change in a single concise sentence, for example: “Add isolation forest–based anomaly detection and data health checks to Spectronaut converter.”

Poem

I nibbled the logs and hopped through each tree,
scoped features and runs, then danced with glee.
With metrics in paw and an isolation nose,
I sniff anomalies where the odd value grows.
Data health checked — carrot praise, code shows. 🥕

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature-anomaly_model

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 36

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
R/utils_clean_features.R (1)

96-122: All-NA groups get dropped in anomaly path; fix selection to be NA-safe and stable

max(Intensity, na.rm=TRUE) returns -Inf for all-NA groups, causing no join match and silent row loss. Also, make tie-breaking deterministic.

-    } else if (length(anomaly_metrics) > 0){
-        
-        input[, row_id := .I]
-        max_per_group = input[, .(max_intensity = max(Intensity, na.rm = TRUE)),
-                              by = feature_columns]
-        
-        input = input[max_per_group, 
-                         on = c(feature_columns, "Intensity" = "max_intensity"),
-                         nomatch = 0L, mult = "first"]
-        input[, row_id := NULL]
-        
+    } else if (length(anomaly_metrics) > 0){
+        input[, row_id := .I]
+        # Replace NA with -Inf for selection only; keep original Intensity
+        input[, Intensity2 := ifelse(is.na(Intensity), -Inf, Intensity)]
+        max_per_group = input[, .(max_intensity = max(Intensity2)), by = feature_columns]
+        data.table::setorder(input, row_id)  # ensure 'first' is stable
+        input = input[max_per_group,
+                      on = c(feature_columns, "Intensity2" = "max_intensity"),
+                      nomatch = 0L, mult = "first"]
+        input[, c("row_id", "Intensity2") := NULL]
man/CheckDataHealth.Rd (1)

1-18: Align Rd documentation with implementation
Update the \value section to describe a list with two data.tables—feature_data (containing an Order column plus moving-window features such as mean_increase, mean_decrease, dispersion_increase generated by .add_features) and skew_results (with columns PSM and skew, the Pearson skewness of AnomalyScores).

🧹 Nitpick comments (48)
R/utils_filtering.R (2)

93-97: Doc param should state it accepts a vector.

Roxygen says “regular expression”; function treats patterns as a vector. Clarify to “character vector of regular expressions”.

Apply in roxygen (this file), then re-run roxygen:

-#' @param patterns chr, regular expression - matching peptides will be 
+#' @param patterns chr, vector of regular expressions — matching peptides will be 

107-114: Vectorize regex filtering for clarity and speed.

Current loop builds an AND over negated grepl. Equivalent, simpler, and faster is a single grepl with an OR’d pattern.

-            pattern_filter = rep(TRUE, nrow(input))
-            for (pattern in patterns) {
-                pattern_filter = pattern_filter & !grepl(pattern, input[[col_name]])
-            }
+            if (length(patterns)) {
+                or_pat <- paste0("(", paste(patterns, collapse = ")|("), ")")
+                pattern_filter = !grepl(or_pat, input[[col_name]])
+            } else {
+                pattern_filter = rep(TRUE, nrow(input))
+            }
src/Makevars.win (1)

1-2: Use SHLIB_OPENMP_CFLAGS in PKG_LIBS and place it last (per R-exts guidance).

Improves cross-platform linking; keep BLAS/LAPACK/FLIBS before the OpenMP flag.

 PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS)
-PKG_LIBS = $(SHLIB_OPENMP_CXXFLAGS) $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS)
+PKG_LIBS = $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) $(SHLIB_OPENMP_CFLAGS)
man/dot-filterFewMeasurements.Rd (1)

23-23: Clarify type: use “character vector” and Rd code markup.

“chr” is informal; prefer Rd-consistent wording.

-\item{feature_columns}{chr, vector of names of columns that define features.}
+\item{feature_columns}{\code{character} vector of names of columns that define features.}
man/dot-getFullDesign.Rd (1)

18-18: Use Rd markup instead of backticks; consider argument order alignment.

Switch to \code{}; optional: list arguments in the same order as in usage().

-\item{`feature_col`}{name of the column that labels features}
+\item{\code{feature_col}}{name of the column that labels features}
src/Makevars (1)

1-2: Mirror OpenMP linking change on non-Windows builds.

Place OpenMP flag last and use SHLIB_OPENMP_CFLAGS for linking.

 PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS)
-PKG_LIBS = $(SHLIB_OPENMP_CXXFLAGS) $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS)
+PKG_LIBS = $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) $(SHLIB_OPENMP_CFLAGS)
man/MSstatsPreprocess.Rd (1)

61-61: Align default description with signature.

Default is c(), not “missing”.

-\item{anomaly_metrics}{character vector of names of columns with quality metrics. Default is missing and is not required if anomaly model not run.}
+\item{anomaly_metrics}{character vector of names of columns with quality metrics. Default is \code{c()} and is not required unless the anomaly model is run.}
man/MSstatsConvert.Rd (1)

21-34: Confirm intent to mark package doc as internal and verify author emails

  • \keyword{internal} will hide the package page from typical help indices; confirm that’s intended.
  • Please verify author emails (e.g., “wu.anthon@…” looks truncated).
man/dot-summarizeMultipleMeasurements.Rd (1)

7-12: Clarify behavior when anomaly_metrics is provided

When anomaly_metrics ≠ c(), implementation selects the single row per group with max Intensity (not aggregator()). Please note this explicitly in Description/Arguments to avoid confusion.

Also applies to: 21-22

R/utils_balanced_design.R (2)

19-24: Guard against row blow-up when merging anomaly metrics in fill_missing path

If multiple rows exist per intensity_ids (same feature/run/channel/etc.) with differing anomaly metrics, merge may duplicate all_possibilities. Deduplicate per intensity_ids before the merge.

-        intensities = intersect(c(intensity_ids, "Intensity",
-                                  "isZero", anomaly_metrics), 
-                                colnames(input))
-        input = merge(all_possibilities, 
-                      unique(input[, intensities, with = FALSE]),
-                      all.x = TRUE, by = intensity_ids)
+        intensities = intersect(c(intensity_ids, "Intensity", "isZero", anomaly_metrics),
+                                colnames(input))
+        intens_dt = unique(input[, intensities, with = FALSE])
+        # ensure uniqueness per key to avoid cartesian expansion
+        intens_dt = intens_dt[ , .SD[1L], by = intensity_ids]
+        input = merge(all_possibilities, intens_dt, all.x = TRUE, by = intensity_ids)

Also applies to: 44-49


68-71: Roxygen param names shouldn’t be backticked

Use @param feature_col and @param measurement_col (without backticks) so roxygen maps them correctly.

-#' @param `feature_col` name of the column that labels features
-#' @param `measurement_col` name of a column with measurement labels - Runs in
+#' @param feature_col name of the column that labels features
+#' @param measurement_col name of a column with measurement labels - Runs in
man/dot-makeBalancedDesign.Rd (1)

12-16: Minor docs polish

Combine the anomaly_metrics sentence; “and marked as NA” reads detached here.

-\item{anomaly_metrics}{character vector of quality metric column names to be
-used as features in an anomaly detection model.
-and marked as NA}
+\item{anomaly_metrics}{character vector of quality-metric column names to be
+used as features in an anomaly detection model; missing values are marked as \code{NA}.}
DESCRIPTION (2)

27-29: Rcpp and stringi in Imports — OK; consider SystemRequirements for C++/OpenMP

Given the new C++ isolation forest (and OpenMP per PR description), declare SystemRequirements to keep CRAN/Bioc checks clean.

 Encoding: UTF-8
 LazyData: true
 Roxygen: list(markdown = TRUE)
 RoxygenNote: 7.3.2
 biocViews: MassSpectrometry, Proteomics, Software, DataImport, QualityControl
+SystemRequirements: C++11, OpenMP

51-51: Collate includes RcppExports.R — optional ordering tweak

RcppExports.R can be placed early in Collate for clarity, though not required if no top-level calls happen before load.

R/RcppExports.R (1)

1-2: Gitleaks false positive on generator token

The generator token line is auto-generated by Rcpp; it’s not a secret. If your CI flags it, add an allowlist rule for this pattern.

Example gitleaks allowlist:

[rules.allowlist]
regexTarget = "match"
regexes = ['10BE3573-1514-4C36-9D1C-5A225CD40393']
R/utils_classes.R (1)

12-14: Threading anomaly_metrics into column selection — OK; validate names early

Looks correct. Consider validating that all anomaly_metrics exist in input and warning on drops to aid users.

Example (in .selectMSstatsColumns or here):

missing <- setdiff(anomaly_metrics, names(input))
if (length(missing)) warning("Unknown anomaly_metrics dropped: ", paste(missing, collapse = ", "))
man/dot-cleanRawSpectronaut.Rd (1)

7-12: Document defaults for new params

If the implementation uses defaults (e.g., calculateAnomalyScores = FALSE, anomalyModelFeatures = NULL), surface them in usage to avoid ambiguity.

-.cleanRawSpectronaut(
-  msstats_object,
-  intensity,
-  calculateAnomalyScores,
-  anomalyModelFeatures
-)
+.cleanRawSpectronaut(
+  msstats_object,
+  intensity,
+  calculateAnomalyScores = FALSE,
+  anomalyModelFeatures = NULL
+)

Also ensure these defaults are present in R/clean_Spectronaut.R roxygen.

man/CheckDataHealth.Rd (3)

5-5: Clarify scope and how to satisfy the anomaly-model requirement.

Avoid tool-specific phrasing and tell users how to fit/provide the model (e.g., via MSstatsAnomalyScores) so the prerequisite is actionable.

Apply this doc tweak:

-\title{Takes as input the output of the SpectronauttoMSstatsFormat function and calculates various quality metrics to assess the health of the data. Requires Anomaly Detection model to be fit.}
+\title{Compute data-health metrics for MSstats-formatted input. Requires an anomaly-detection model (e.g., computed via \code{MSstatsAnomalyScores}) to be available.}
-\description{
-Takes as input the output of the SpectronauttoMSstatsFormat function and calculates various quality metrics to assess the health of the data. Requires Anomaly Detection model to be fit.
-}
+\description{
+Takes as input an MSstats-formatted table (e.g., from \code{SpectronauttoMSstatsFormat}) and calculates quality metrics to assess data health. Requires that anomaly scores have been computed (e.g., via \code{MSstatsAnomalyScores}) or provided in the input.
+}

Also applies to: 16-16


10-11: Broaden and specify the input contract.

Make clear it accepts generic MSstats-format, not only Spectronaut, and state required columns.

-\item{input}{MSstats input which is the output of Spectronaut converter}
+\item{input}{A \code{data.table} in MSstats format (e.g., from \code{SpectronauttoMSstatsFormat}); must contain standard MSstats columns and, if available, anomaly-score columns.}

12-14: Document the returned list elements explicitly.

Naming the two elements reduces ambiguity for downstream users/tests.

-\value{
-list of two data.tables
-}
+\value{
+A named list of two \code{data.table}s:
+\itemize{
+  \item \code{feature_data}: per-feature metrics (e.g., coverage, dispersion, outliers).
+  \item \code{skew_results}: distributional summaries (e.g., skewness of \code{AnomalyScores}).
+}
+}
man/MSstatsBalancedDesign.Rd (1)

39-40: Explain how anomaly_metrics are used.

State that these columns are carried through/considered in selection and any constraints.

-\item{anomaly_metrics}{character vector of names of columns with quality metrics}
+\item{anomaly_metrics}{Character vector of column names with quality/peak metrics (e.g., anomaly scores) to retain and propagate during balancing and feature selection. Defaults to \code{character(0)} (none).}
man/MSstatsClean.Rd (1)

115-118: Tighten argument docs and interaction

-\item{calculateAnomalyScores}{logical, whether to calculate anomaly scores}
+\item{calculateAnomalyScores}{Logical; if \code{TRUE}, compute anomaly scores during cleaning (default: \code{FALSE}).}

-\item{anomalyModelFeatures}{character vector, specifies which columns will be used for anomaly detection model. Can be NULL if calculateAnomalyScores=FALSE.}
+\item{anomalyModelFeatures}{Character vector of column names used as features by the anomaly-detection model (ignored when \code{calculateAnomalyScores = FALSE}). Typically includes intensity or quality metrics per PSM/fragment.}
inst/tinytest/test_utils_anomaly_score.R (6)

103-113: Correlation threshold is brittle without determinism.

With seeding, this is fine; otherwise lower the threshold or switch to Spearman on ranks.

-correlation = cor(scores1$AnomalyScores, scores2$AnomalyScores)
-expect_true(correlation > 0.95,
+correlation = cor(scores1$AnomalyScores, scores2$AnomalyScores, method = "spearman")
+expect_true(correlation > 0.9,
             info = "Multiple runs should produce correlated results")

296-303: Use explicit tolerances for floating-point equality.

Be robust to minor numeric differences.

-expect_equal(
+expect_equal(
     mean_increase,
     c(0.00, 0.00, 0.26, 0.00, 0.00, 
       0.00, 0.50, 0.00, 0.00, 0.00, 
       0.00, 0.00, 0.00, 0.00, 0.55, 
       0.87, 1.11, 1.42, 1.71, 2.94),
-    info = "Mean increase calculation should match expected values"
+    tolerance = 1e-8,
+    info = "Mean increase calculation should match expected values"
 )
-expect_equal(
+expect_equal(
     mean_decrease,
     c(0.00, 0.00, 0.00, 0.29, 0.50, 
       0.56, 0.00, 1.16, 0.99, 0.15, 
       0.00, 0.00, 1.27, 1.84, 0.29, 
       0.00, 0.00, 0.00, 0.00, 0.00),
-    info = "Mean decrease calculation should match expected values"
+    tolerance = 1e-8,
+    info = "Mean decrease calculation should match expected values"
 )

Also applies to: 305-313


243-245: Remove noisy console output in tests.

Printing can clutter CI logs; prefer assertions only.

-cat("Top 5 anomalous rows:", order(regression_scores$AnomalyScores, decreasing = TRUE)[1:5], "\n")
-cat("Score range:", range(regression_scores$AnomalyScores), "\n")
+invisible(NULL)
-cat("All anomaly detection tests completed successfully!\n")
+invisible(NULL)

Also applies to: 324-324


275-282: Fix mismatched comment about expected ranking.

The data make Row 6 highest, Row 5 second, Row 4 third.

-# Row 5 should have highest score, Row 4 second highest, etc.
+# Expected: Row 6 > Row 5 > Row 4 > Rows 1–3

8-16: Prefer public APIs in tests where possible.

Relying on non-exported functions via \code{:::} is brittle. Add at least one integration test using \code{MSstatsAnomalyScores} to validate end-to-end behavior; keep internals for unit tests.

I can draft a tinytest block invoking MSstatsAnomalyScores on a minimal dataset and asserting expected ordering—want me to add it?

Also applies to: 292-295


9-11: Optional: speed up CI by reducing trees.

For tiny synthetic data, fewer trees should preserve ordering and cut runtime.

-    anomaly_output = MSstatsConvert:::.runAnomalyModel(df, 
-                                                        n_trees=100, 
+    anomaly_output = MSstatsConvert:::.runAnomalyModel(df, 
+                                                        n_trees=50, 
                                                         max_depth="auto", 
man/MSstatsAnomalyScores.Rd (2)

24-36: Specify allowed values and semantics for temporal_direction

List the permitted values (e.g., mean_increase, mean_decrease, dispersion_increase), required length = length(quality_metrics), and what each produces. Add validation notes if mismatched.


38-43: Return value is underspecified

Document the returned columns (e.g., AnomalyScores, avg_depth), their ranges, and how they map back to input rows/runs. This helps downstream users/tests.

man/dot-validateMSstatsConverterParameters.Rd (1)

18-45: Tighten parameter contracts (types and ranges) in docs to match validation

Be explicit:

  • n_trees: integer > 0
  • max_depth: integer > 0 or "auto"
  • removeMissingFeatures: 0 ≤ x ≤ 1 (fraction)
  • anomalyModelFeatureCount: integer ≥ 1
  • numberOfCores: integer ≥ 1
  • anomalyModelFeatures/anomalyModelFeatureTemporal: non-empty; temporal values from a fixed set
    Update details to reflect these constraints so tests and users know expected types.
inst/tinytest/test_cleanRaw.R (2)

198-204: Also assert presence of the new columns, not just column count

Counting columns can pass while names drift. Add checks for EGQvalue/PGQvalue, FPossibleInterference, FExcludedFromQuantification when calculateAnomalyScores = FALSE.

 expect_equal(ncol(sn_cleaned), 12)
 expect_true(nrow(sn_cleaned) > 0)
+expect_true(all(c("EGQvalue","FPossibleInterference","FExcludedFromQuantification") %in% colnames(sn_cleaned)) ||
+            all(c("PGQvalue","FPossibleInterference","FExcludedFromQuantification") %in% colnames(sn_cleaned)))

205-221: Strengthen anomaly-feature assertions

Verify that requested anomalyModelFeatures are actually present post-cleaning (not just column count).

 expect_equal(ncol(sn_cleaned_2), 16)
 expect_true(nrow(sn_cleaned_2) > 0)
+expect_true(all(c("FGShapeQualityScore(MS2)",
+                  "FGShapeQualityScore(MS1)",
+                  "EGApexRT") %in% colnames(sn_cleaned_2)))
src/RcppExports.cpp (1)

13-24: Optional: avoid copying DataFrame at the C++ boundary

If performance becomes a concern, change calculate_anomaly_score signature to const DataFrame& and regenerate Rcpp exports. Not urgent.

inst/tinytest/test_converters_SpectronauttoMSstatsFormat.R (1)

60-88: Minor: make temporal ordering deterministic across environments

Using seq(1:nrow(temporal)) is fine, but rely on an explicit order (e.g., by Run) to avoid potential nondeterminism from unique().

-temporal$Order = seq(1:nrow(temporal))
+data.table::setorder(temporal, Run)
+temporal$Order = seq_len(nrow(temporal))
R/clean_Spectronaut.R (2)

19-26: Consistency: if FExcludedFromQuantification is required, don’t treat it as optional

.validateSpectronautInput requires FExcludedFromQuantification. Using .findAvailable for exclude_col implies optional. Either remove it from required list or include directly here without .findAvailable.

-  exclude_col = .findAvailable(c("FExcludedFromQuantification"), 
-                               colnames(spec_input))
+  exclude_col = "FExcludedFromQuantification"

23-27: Broaden EG Q-value alias handling

Older Spectronaut exports sometimes use EG.Q.Value. Consider .findAvailable(c("EGQvalue","EG.Q.Value")).

-  cols = c("PGProteinGroups", "EGModifiedSequence", "FGCharge", "FFrgIon", 
-           f_charge_col, "RFileName", "RCondition", "RReplicate", 
-           "EGQvalue", pg_qval_col, interference_col, exclude_col,
+  eg_qval_col = .findAvailable(c("EGQvalue","EG.Q.Value"), colnames(spec_input))
+  cols = c("PGProteinGroups", "EGModifiedSequence", "FGCharge", "FFrgIon", 
+           f_charge_col, "RFileName", "RCondition", "RReplicate", 
+           eg_qval_col, pg_qval_col, interference_col, exclude_col,
            paste0("F", intensity))
inst/tinytest/test_utils_data_health.R (2)

126-130: Make the missing-coverage assertion deterministic.

Using result_missing$Feature[1] is order-dependent. Select the known Feature you NAd to avoid flaky tests.

-first_feature = result_missing$Feature[1]
-first_feature_missing = result_missing[Feature == first_feature]$percent_missing
+target_feature = paste("AEFEEQNVR", "2", "y3", "1", sep="_")
+first_feature_missing = result_missing[Feature == target_feature]$percent_missing

161-164: Remove debug note and clarify finite skew expectation.

The “(FAIL)” note is misleading. If there’s any per-PSM uniformity upstream, skew can be NaN. Either remove the note or widen the check to match your uniform-case logic.

-# Check that skewness values are numeric and finite (FAIL)
-expect_true(all(is.numeric(result$skew)), info = ".checkAnomalySkew returns numeric skewness values")
-expect_true(all(is.finite(result$skew)), info = ".checkAnomalySkew returns finite skewness values")
+# Check that skewness values are numeric and well-formed
+expect_true(is.numeric(result$skew), info = ".checkAnomalySkew returns numeric skewness values")
+expect_true(all(is.finite(result$skew) | is.nan(result$skew)),
+            info = ".checkAnomalySkew returns finite/NaN skewness depending on variance")
R/utils_checks.R (1)

229-231: Be permissive for integer-like run order and avoid partial recycling.

is.numeric() excludes integer64/factors; assertNumber on the vector is clearer. Also enforce no NA.

-            if (!is.numeric(config$runOrder$Order)) {
-                stop("runOrder$Order must be numeric")
-            }
+            checkmate::assertNumeric(config$runOrder$Order, any.missing = FALSE, all.missing = FALSE)
src/isolation_forest.cpp (1)

15-15: Fixed RNG seed hurts reproducibility across sessions and users.

Either accept a seed from R or draw from R’s RNG via R::rnorm/Runif or pass a seed parameter.

R/utils_data_health.R (1)

41-41: Magic number 0.5 should be parameterized

The threshold value 0.5 for high ratio detection is hardcoded. Consider making this configurable for flexibility.

Consider adding a parameter to make the threshold configurable:

-.checkFeatureSD = function(input){
+.checkFeatureSD = function(input, ratio_threshold = 0.5){
     sd_data = input[, .(sd_Intensity = sd(log2(Intensity), na.rm = TRUE),
                         mean_Intensity = mean(log2(Intensity), na.rm = TRUE),
                         ratio = sd(log2(Intensity), na.rm = TRUE) / mean(
                             log2(Intensity), na.rm = TRUE)), 
                     by = Feature]
     
-    high_ratio = nrow(sd_data[sd_data$ratio > .5]) / nrow(sd_data)
+    high_ratio = nrow(sd_data[sd_data$ratio > ratio_threshold]) / nrow(sd_data)
R/utils_anomaly_score.R (1)

10-15: Consider using data.table syntax more efficiently

The Fragment and PSM construction could be more efficient using data.table's := operator.

-    input$Fragment = paste(input$PeptideSequence,
-                           input$PrecursorCharge, 
-                           input$FragmentIon,
-                           input$ProductCharge, sep="_")
-    input$PSM = paste(input$PeptideSequence,
-                           input$PrecursorCharge, sep="_")
+    input[, `:=`(
+        Fragment = paste(PeptideSequence, PrecursorCharge, 
+                        FragmentIon, ProductCharge, sep="_"),
+        PSM = paste(PeptideSequence, PrecursorCharge, sep="_")
+    )]
man/SpectronauttoMSstatsFormat.Rd (2)

60-60: Documentation inconsistency for anomalyModelFeatureTemporal

The documentation mentions possible values but doesn't clarify if multiple temporal features can be applied to the same quality metric.

Consider clarifying the documentation:

-\item{anomalyModelFeatureTemporal}{character vector of temporal direction corresponding to columns passed to anomalyModelFeatures. Values must be one of: \code{mean_decrease}, \code{mean_increase}, \code{dispersion_increase}, or NULL (to perform no temporal feature engineering). Default is empty vector. If calculateAnomalyScores=TRUE, vector must have as many values as anomalyModelFeatures (even if all NULL).}
+\item{anomalyModelFeatureTemporal}{character vector of temporal direction corresponding to columns passed to anomalyModelFeatures. Each value must be one of: \code{mean_decrease}, \code{mean_increase}, \code{dispersion_increase}, or NULL (to perform no temporal feature engineering). Default is empty vector. If calculateAnomalyScores=TRUE, vector must have exactly as many values as anomalyModelFeatures, with a one-to-one correspondence (even if all NULL).}

70-70: Documentation could be clearer about max_depth calculation

The automatic depth calculation formula should be more explicit.

-\item{max_depth}{Max tree depth to use in isolation forest when calculateAnomalyScores=TRUE. Default is "auto" which calculates depth as log2(N) where N is the number of runs. Otherwise must be an integer.}
+\item{max_depth}{Max tree depth to use in isolation forest when calculateAnomalyScores=TRUE. Default is "auto" which calculates depth as log2(N) where N is the number of observations per PSM. Otherwise must be a positive integer.}
R/MSstatsConvert_core_functions.R (3)

325-325: Fix documentation formatting

The documentation comment has inconsistent formatting. Missing space after "model" and before "not".

-#' @param anomaly_metrics character vector of names of columns with quality metrics. Default is missing and is not required if anomaly model not run.
+#' @param anomaly_metrics character vector of names of columns with quality metrics. Default is missing and is not required if anomaly model is not run.

564-567: Use column selection by reference for efficiency

The current approach creates a character vector of column names and then uses .. prefix. Consider using .SD and .SDcols for better performance and clarity.

-    subset_cols = subset_cols[subset_cols %in% names(input)]
-    input = input[, ..subset_cols]
+    subset_cols = subset_cols[subset_cols %in% names(input)]
+    input = input[, .SD, .SDcols = subset_cols]

531-531: Documentation: Add @importFrom directive

The function uses Rcpp through .Call but the directive @useDynLib is included. Consider also adding @importFrom Rcpp sourceCpp if the package uses Rcpp modules.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8e38072 and 9ebd83e.

⛔ Files ignored due to path filters (5)
  • inst/tinytest/raw_data/Spectronaut/annotation.csv is excluded by !**/*.csv
  • inst/tinytest/raw_data/Spectronaut/spectronaut_quality_input.csv is excluded by !**/*.csv
  • src/MSstatsConvert.so is excluded by !**/*.so
  • src/RcppExports.o is excluded by !**/*.o
  • src/isolation_forest.o is excluded by !**/*.o
📒 Files selected for processing (42)
  • DESCRIPTION (3 hunks)
  • NAMESPACE (3 hunks)
  • R/MSstatsConvert_core_functions.R (8 hunks)
  • R/RcppExports.R (1 hunks)
  • R/clean_Spectronaut.R (1 hunks)
  • R/converters_SpectronauttoMSstatsFormat.R (4 hunks)
  • R/utils_MSstatsConvert.R (1 hunks)
  • R/utils_annotation.R (1 hunks)
  • R/utils_anomaly_score.R (1 hunks)
  • R/utils_balanced_design.R (4 hunks)
  • R/utils_checks.R (1 hunks)
  • R/utils_classes.R (1 hunks)
  • R/utils_clean_features.R (5 hunks)
  • R/utils_data_health.R (1 hunks)
  • R/utils_dt_operations.R (1 hunks)
  • R/utils_filtering.R (1 hunks)
  • inst/tinytest/test_cleanRaw.R (1 hunks)
  • inst/tinytest/test_converters_SpectronauttoMSstatsFormat.R (1 hunks)
  • inst/tinytest/test_utils_anomaly_score.R (1 hunks)
  • inst/tinytest/test_utils_data_health.R (1 hunks)
  • man/CheckDataHealth.Rd (1 hunks)
  • man/MSstatsAnomalyScores.Rd (1 hunks)
  • man/MSstatsBalancedDesign.Rd (2 hunks)
  • man/MSstatsClean.Rd (2 hunks)
  • man/MSstatsConvert.Rd (2 hunks)
  • man/MSstatsPreprocess.Rd (2 hunks)
  • man/SpectronauttoMSstatsFormat.Rd (3 hunks)
  • man/dot-MSstatsFormat.Rd (1 hunks)
  • man/dot-cleanByFeature.Rd (2 hunks)
  • man/dot-cleanRawSpectronaut.Rd (1 hunks)
  • man/dot-filterByPattern.Rd (1 hunks)
  • man/dot-filterFewMeasurements.Rd (1 hunks)
  • man/dot-getFullDesign.Rd (1 hunks)
  • man/dot-makeBalancedDesign.Rd (1 hunks)
  • man/dot-mergeAnnotation.Rd (1 hunks)
  • man/dot-selectMSstatsColumns.Rd (1 hunks)
  • man/dot-summarizeMultipleMeasurements.Rd (1 hunks)
  • man/dot-validateMSstatsConverterParameters.Rd (1 hunks)
  • src/Makevars (1 hunks)
  • src/Makevars.win (1 hunks)
  • src/RcppExports.cpp (1 hunks)
  • src/isolation_forest.cpp (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/RcppExports.cpp (1)
src/isolation_forest.cpp (2)
  • calculate_anomaly_score (167-202)
  • calculate_anomaly_score (167-171)
src/isolation_forest.cpp (1)
src/RcppExports.cpp (3)
  • df (19-19)
  • max_depth (21-21)
  • n_trees (20-20)
🪛 Gitleaks (8.27.2)
R/RcppExports.R

[high] 2-2: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (27)
R/utils_MSstatsConvert.R (1)

15-15: Confirm intent: marking the package-level help page as internal hides it from users.

If unintended, drop the internal keyword to keep ?MSstatsConvert visible.

-#' @keywords internal
man/MSstatsConvert.Rd (1)

3-6: DocType/alias changes look fine

R/utils_balanced_design.R (1)

4-6: Signature and docs addition look good

Also applies to: 9-9

NAMESPACE (3)

30-30: Rcpp import matches the Rcpp glue — OK

This pairs correctly with LinkingTo/Imports in DESCRIPTION and the RcppExports.* files. No action needed.


44-44: useDynLib with .registration = TRUE — correct

Matches Rcpp routine registration; keeps symbol lookup strict. Approved.


9-9: Approve export of MSstatsAnomalyScores — all wiring and documentation verified.
Function definition exists, roxygen @export, NAMESPACE export, and an Rd page are all present.

DESCRIPTION (2)

66-66: New utils files in Collate — OK

utils_anomaly_score.R and utils_data_health.R are listed; aligns with new exports.

Also applies to: 71-71


33-35: LinkingTo: Rcpp — Makevars/Makevars.win already set PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS), so compiler and OpenMP flags are in place.

R/RcppExports.R (1)

4-6: calculate_anomaly_score wrapper: all call sites supply max_depth — a search of R/**/*.R found only the call in R/utils_anomaly_score.R passing three arguments (single_psm, n_trees, max_depth), so no convenience wrapper is needed.

R/utils_classes.R (1)

8-8: Param doc reads well

Clear description of anomaly_metrics. Approved.

man/dot-MSstatsFormat.Rd (1)

7-13: Docs align with code — good

Usage and argument docs for anomaly_metrics are consistent with R/utils_classes.R.

man/dot-cleanByFeature.Rd (1)

7-12: Anomaly-metrics propagation docs — LGTM

Signature and argument description are clear and backward compatible.

Also applies to: 22-23

man/MSstatsBalancedDesign.Rd (1)

13-15: Defaults and propagation of anomaly_metrics verified. Parameter is correctly forwarded to both .makeBalancedDesign and .selectMSstatsColumns.

src/RcppExports.cpp (1)

13-24: Registration wiring verified
NAMESPACE contains useDynLib(MSstatsConvert, .registration = TRUE), R/RcppExports.R calls .Call("_MSstatsConvert_calculate_anomaly_score",…), and the C++ signature in src/isolation_forest.cpp matches exactly. No changes needed.

R/clean_Spectronaut.R (1)

8-11: No upstream .cleanRawSpectronaut call sites detected
rg -nP '\.cleanRawSpectronaut\s*\(' --type=r returned no matches; confirm any external or downstream callers before merging.

inst/tinytest/test_utils_data_health.R (1)

205-235: Good coverage of skewness edge cases.

The skewed, uniform, and mixed-score scenarios are helpful for robustness.

R/utils_checks.R (3)

80-113: Parameter defaults/readability LGTM.

Centralized defaults and modifyList() merge are clear and maintainable.


81-121: max_depth “auto” is correctly converted: .runAnomalyModel checks for max_depth == "auto" and sets it to round(log2(nrow(...))) before calling calculate_anomaly_score (R/utils_anomaly_score.R 185–187).


249-251: Write-permission check has operator-precedence bug.

!file.access(..) == 0 is always FALSE. Compare directly against 0.

-        if (!file.access(log_dir, mode = 2) == 0) {
-            stop("No write permission for log file directory: ", log_dir)
-        }
+        if (file.access(log_dir, mode = 2) != 0) {
+            stop("No write permission for log file directory: ", log_dir)
+        }

Likely an incorrect or invalid review comment.

src/isolation_forest.cpp (1)

198-201: Column names in return value: confirm downstream expectations.

Downstream code must expect avg_depth and anomaly_score. Verify R wrappers/tests use these exact names.

R/utils_anomaly_score.R (1)

197-197: Anomaly score clipping uses incorrect function

Using pmax with 0.001 as minimum value, but the comment says "clip to stop them from exploding" which suggests you want to limit maximum values, not minimum.

Please verify the intended behavior. If you want to limit maximum values, use pmin. If you want to ensure minimum values (e.g., avoid zeros), the current code is correct but the comment should be updated:

     model_results = unlist(model_results)
-    # Clip anomaly scores to stop them from exploding
+    # Ensure minimum anomaly score of 0.001 to avoid zero values
     input_data$AnomalyScores = pmax(model_results, .001)
man/SpectronauttoMSstatsFormat.Rd (1)

18-26: Validation for anomaly detection parameters is already implemented
The helper .validateMSstatsConverterParameters enforces that when calculateAnomalyScores=TRUE, anomalyModelFeatures is non-empty, anomalyModelFeatureTemporal (if provided) matches length and allowed values, and validates n_trees, max_depth, anomalyModelFeatureCount, removeMissingFeatures, and runOrder.

R/MSstatsConvert_core_functions.R (3)

367-367: LGTM! Clean integration of anomaly metrics

The addition of the anomaly_metrics parameter with a sensible default value integrates well with the existing function signature.


384-387: Parameter propagation looks correct

The anomaly_metrics parameter is correctly passed through to the internal .cleanByFeature function, maintaining the data flow for anomaly detection features.


412-412: LGTM! Consistent propagation of anomaly metrics through the pipeline

The anomaly_metrics parameter is correctly added to MSstatsBalancedDesign and properly propagated through to .makeBalancedDesign and .MSstatsFormat functions, ensuring the anomaly detection data flows through the entire preprocessing pipeline.

Also applies to: 429-429, 442-442, 452-452

R/converters_SpectronauttoMSstatsFormat.R (2)

48-74: LGTM! Comprehensive parameter validation setup

The validation configuration object comprehensively captures all function parameters for validation, which is a good practice for maintaining data integrity.


79-79: Add validation for standardized feature names
After anomalyModelFeatures = .standardizeColnames(anomalyModelFeatures), verify they’re present in the data—for example:

missing_feats <- setdiff(anomalyModelFeatures, colnames(input))
if (length(missing_feats)) {
  stop("Unknown anomalyModelFeatures after standardization: ",
       paste(missing_feats, collapse=", "))
}

Comment on lines +111 to +127
# runOrder not provided
expect_error(SpectronauttoMSstatsFormat(
spectronaut_raw,
annotation = annotation,
calculateAnomalyScores = TRUE,
anomalyModelFeatures = c("FG.ShapeQualityScore (MS2)",
"FG.ShapeQualityScore (MS1)",
"EG.ApexRT"),
anomalyModelFeatureTemporal = c("mean_decrease",
"mean_increase",
"dispersion_increase"),
removeMissingFeatures = 0.5,
anomalyModelFeatureCount = 100,
runOrder = NULL,
n_trees = 100,
max_depth = "auto"
))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

annotation is referenced but not defined in this test file

The error-path tests pass annotation = annotation without creating annotation. Define a minimal annotation data.frame before these blocks to avoid NameError.

 # Spectronaut quality features error handling -------------------------------
 
 # runOrder not provided
+annotation <- data.table::data.table(
+  Run = unique(spectronaut_raw$R.FileName),
+  Condition = spectronaut_raw$R.Condition[match(unique(spectronaut_raw$R.FileName), spectronaut_raw$R.FileName)],
+  BioReplicate = seq_along(unique(spectronaut_raw$R.FileName))
+)
 expect_error(SpectronauttoMSstatsFormat(
🤖 Prompt for AI Agents
In inst/tinytest/test_converters_SpectronauttoMSstatsFormat.R around lines
111–127, the test calls expect_error(..., annotation = annotation) but no
annotation object is defined; create a minimal annotation data.frame immediately
before these test blocks (or at top of the file) that includes the columns the
SpectronauttoMSstatsFormat function expects (e.g., run identifier and any
required metadata such as Run/Run.ID, Condition, BioReplicate) with appropriate
types and at least one row so the tests can reference annotation without raising
a NameError.

Comment on lines +8 to +16
anomaly_output = MSstatsConvert:::.runAnomalyModel(df,
n_trees=100,
max_depth="auto",
cores=1,
split_column="PSM",
quality_metrics=c("QualityMetric.mean_increase",
"QualityMetric.mean_decrease",
"QualityMetric.dispersion_increase"))
return(anomaly_output)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Seed the RNG to prevent flaky tests.

Isolation Forest is stochastic; many assertions depend on ranking/correlation. Set a seed inside the helper.

 run_quality_metrics = function(df, mean_increase, mean_decrease, dispersion_increase) {
+    set.seed(42)
     df$QualityMetric.mean_increase = mean_increase
     df$QualityMetric.mean_decrease = mean_decrease
     df$QualityMetric.dispersion_increase = dispersion_increase
     anomaly_output = MSstatsConvert:::.runAnomalyModel(df, 
-                                                        n_trees=100, 
+                                                        n_trees=100, 
                                                         max_depth="auto", 
                                                         cores=1,

If available, prefer passing a deterministic seed/random_state into the model instead of globally seeding.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
anomaly_output = MSstatsConvert:::.runAnomalyModel(df,
n_trees=100,
max_depth="auto",
cores=1,
split_column="PSM",
quality_metrics=c("QualityMetric.mean_increase",
"QualityMetric.mean_decrease",
"QualityMetric.dispersion_increase"))
return(anomaly_output)
run_quality_metrics = function(df, mean_increase, mean_decrease, dispersion_increase) {
set.seed(42)
df$QualityMetric.mean_increase = mean_increase
df$QualityMetric.mean_decrease = mean_decrease
df$QualityMetric.dispersion_increase = dispersion_increase
anomaly_output = MSstatsConvert:::.runAnomalyModel(
df,
n_trees = 100,
max_depth = "auto",
cores = 1,
split_column = "PSM",
quality_metrics = c(
"QualityMetric.mean_increase",
"QualityMetric.mean_decrease",
"QualityMetric.dispersion_increase"
)
)
return(anomaly_output)
}
🤖 Prompt for AI Agents
In inst/tinytest/test_utils_anomaly_score.R around lines 8 to 16, the call to
MSstatsConvert:::.runAnomalyModel is nondeterministic and makes tests flaky;
either seed the RNG locally before calling the helper (e.g. call
set.seed(<fixed>) immediately prior to the .runAnomalyModel call and optionally
save/restore .Random.seed to avoid global side effects), or better, update
.runAnomalyModel to accept a seed/random_state argument and pass that through to
the underlying isolation-forest routine so the model is deterministic; implement
one of these approaches and update the test to use the deterministic option.

Comment on lines +347 to +352
temporal = spectronaut_raw[, .(Condition = unique(Condition),
Run = unique(Run)),
by = BioReplicate]
temporal$Order = seq(1:nrow(temporal))
temporal = temporal[, c("Run", "Order")]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Bug: run_order has duplicated Runs and mismatched lengths.

Using \code{unique(Condition)} per group can return length > 1; assigning both Condition and Run together then building Order yields duplicate Runs with conflicting orders.

Apply this fix to map each Run to a single order:

-temporal = spectronaut_raw[, .(Condition = unique(Condition), 
-                                    Run = unique(Run)), 
-                                by = BioReplicate]
-temporal$Order = seq(1:nrow(temporal))
-temporal = temporal[, c("Run", "Order")]
+temporal = unique(spectronaut_raw[, .(Run)])
+temporal[, Order := seq_len(.N)]

Optionally sort by Run or acquisition time if available before assigning Order.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
temporal = spectronaut_raw[, .(Condition = unique(Condition),
Run = unique(Run)),
by = BioReplicate]
temporal$Order = seq(1:nrow(temporal))
temporal = temporal[, c("Run", "Order")]
# Replace the block starting at line 347 with:
temporal = unique(spectronaut_raw[, .(Run)])
temporal[, Order := seq_len(.N)]

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
R/converters_SpectronauttoMSstatsFormat.R (4)

11-11: Fix grammar and finalize semantics for temporal vector

Change “vector must have as many values” → “must have the same length,” and decide between NULL or FALSE for “no temporal feature engineering” per feature. Prior review requested FALSE; code/docs still use NULL. Align docs, validation, and implementation.

-#' @param anomalyModelFeatureTemporal character vector of temporal direction corresponding to columns passed to anomalyModelFeatures. Values must be one of: `mean_decrease`, `mean_increase`, `dispersion_increase`, or NULL (to perform no temporal feature engineering). Default is empty vector. If calculateAnomalyScores=TRUE, vector must have as many values as anomalyModelFeatures (even if all NULL).
+#' @param anomalyModelFeatureTemporal character vector parallel to anomalyModelFeatures. Each entry must be one of: `mean_decrease`, `mean_increase`, `dispersion_increase`, or `NULL` (no temporal feature engineering for that feature). Default is empty vector. If calculateAnomalyScores=TRUE, the vector must have the same length as anomalyModelFeatures.

If the code actually supports FALSE instead of NULL for “no temporal feature engineering,” update the above accordingly and ensure the validator enforces equal lengths.


14-14: Clarify that run order must be sequential integers

Make the constraint explicit to help users.

-#' @param runOrder Temporal order of MS runs. Should be a two column data.table with columns `Run` and `Order`, where `Run` matches the run name output by Spectronaut and `Order` is an integer. Used to engineer the temporal features defined in anomalyModelFeatureTemporal.
+#' @param runOrder Temporal order of MS runs. Two-column data.table with `Run` and `Order`, where `Run` matches Spectronaut run names and `Order` is a sequential integer (1, 2, 3, ...). Used to engineer the temporal features in anomalyModelFeatureTemporal.

136-141: Wrap anomaly scoring in tryCatch for actionable errors and logs

Surface clear messages and keep logs consistent.

-    if (calculateAnomalyScores){
-        input = MSstatsConvert::MSstatsAnomalyScores(
-            input, anomalyModelFeatures, anomalyModelFeatureTemporal,
-            removeMissingFeatures, anomalyModelFeatureCount, runOrder, n_trees, 
-            max_depth, numberOfCores)
-    }
+    if (calculateAnomalyScores){
+        tryCatch({
+            input <- MSstatsConvert::MSstatsAnomalyScores(
+                input, anomalyModelFeatures, anomalyModelFeatureTemporal,
+                removeMissingFeatures, anomalyModelFeatureCount, runOrder, n_trees,
+                max_depth, numberOfCores)
+        }, error = function(e) {
+            msg <- paste("Failed to calculate anomaly scores:", conditionMessage(e))
+            getOption("MSstatsLog")("ERROR", msg)
+            getOption("MSstatsMsg")("ERROR", msg)
+            stop(msg, call. = FALSE)
+        })
+    }

126-129: Avoid overwriting existing IsotopeLabelType values

columns_to_fill unconditionally sets IsotopeLabelType = "L", risking data loss. Fill only when the column is missing or NA.

-        exact_filtering = list(excluded_quant = excluded_quant_filter),
-        columns_to_fill = list("IsotopeLabelType" = "L"),
+        exact_filtering = list(excluded_quant = excluded_quant_filter),
+        columns_to_fill = NULL,
         anomaly_metrics = anomalyModelFeatures)

Add before MSstatsPreprocess (outside this hunk):

# Preserve existing labels; add default only if missing/NA
if (!"IsotopeLabelType" %in% colnames(input)) {
  input[, IsotopeLabelType := "L"]
} else {
  input[is.na(IsotopeLabelType) | IsotopeLabelType == "", IsotopeLabelType := "L"]
}
🧹 Nitpick comments (3)
R/converters_SpectronauttoMSstatsFormat.R (3)

3-3: Docs vs behavior mismatch for F.ExcludedFromQuantification

Docs say rows “will be removed,” but code config uses behavior="fill" + drop_column, which sets values to NA and drops the column (not guaranteed row removal). Clarify this and mention the standardized column name without dot.

-#' @param input name of Spectronaut output, which is long-format. ProteinName, PeptideSequence, PrecursorCharge, FragmentIon, ProductCharge, IsotopeLabelType, Condition, BioReplicate, Run, Intensity, F.ExcludedFromQuantification are required. Rows with F.ExcludedFromQuantification=True will be removed.
+#' @param input name of Spectronaut output (long-format). Required columns: ProteinName, PeptideSequence, PrecursorCharge, FragmentIon, ProductCharge, IsotopeLabelType, Condition, BioReplicate, Run, Intensity, F.ExcludedFromQuantification (Spectronaut), which is standardized to FExcludedFromQuantification by import. When excludedFromQuantificationFilter=TRUE, intensities with FExcludedFromQuantification=TRUE are set to NA and the column is dropped (rows may be removed downstream).

7-8: Column naming in docs (EG.Qvalue vs EGQvalue)

Standardization removes punctuation; adjust docs to avoid confusion.

-#' @param filter_with_Qvalue FALSE(default) will not perform any filtering. TRUE will filter out the intensities that have greater than qvalue_cutoff in EG.Qvalue column. Those intensities will be replaced with zero and will be considered as censored missing values for imputation purpose.
-#' @param qvalue_cutoff Cutoff for EG.Qvalue. default is 0.01.
+#' @param filter_with_Qvalue FALSE (default) performs no filtering. TRUE filters intensities with EG.Qvalue/EGQvalue > qvalue_cutoff (column name after standardization is EGQvalue). Filtered intensities are replaced with zero and treated as censored missing values for imputation.
+#' @param qvalue_cutoff Cutoff for EG.Qvalue/EGQvalue. Default is 0.01.

48-72: Enforce key invariants in validator (lengths, empties, cores on Windows)

Ensure .validateMSstatsConverterParameters checks:

  • calculateAnomalyScores implies length(anomalyModelFeatures) > 0.
  • If anomalyModelFeatureTemporal provided, its length equals anomalyModelFeatures.
  • If numberOfCores > 1 on Windows, warn and coerce to 1.
# Example checks (either inside .validateMSstatsConverterParameters or here before use)
if (calculateAnomalyScores && length(anomalyModelFeatures) == 0L)
  stop("anomalyModelFeatures must be non-empty when calculateAnomalyScores=TRUE")

if (calculateAnomalyScores && length(anomalyModelFeatureTemporal) > 0L &&
    length(anomalyModelFeatureTemporal) != length(anomalyModelFeatures))
  stop("anomalyModelFeatureTemporal must have the same length as anomalyModelFeatures")

if (.Platform$OS.type == "windows" && numberOfCores > 1L) {
  getOption("MSstatsMsg")("WARN", "Parallel anomaly scoring is not supported on Windows; using 1 core.")
  numberOfCores <- 1L
}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ebd83e and aa1e754.

📒 Files selected for processing (1)
  • R/converters_SpectronauttoMSstatsFormat.R (4 hunks)
🔇 Additional comments (6)
R/converters_SpectronauttoMSstatsFormat.R (6)

35-47: API additions read cleanly; defaults look reasonable

No functional concerns with the new parameters and defaults.


79-80: Good: standardizing feature names early

This avoids downstream mismatches.


84-86: Plumbing anomaly flags into MSstatsClean looks correct

No issues spotted.


105-112: Fix confirmed: excludedFromQuantificationFilter is now honored

The filter flag now respects the function parameter. Nice.


131-135: BalancedDesign wiring for anomaly_metrics is consistent

Looks good.


16-16: Verified: “auto” uses base-2 log2(N) on the number of runs; documentation is accurate.

@devonjkohler devonjkohler merged commit ffd55fb into devel Sep 9, 2025
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants