-
Notifications
You must be signed in to change notification settings - Fork 5
Feature anomaly model #111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…MSstatsConvert into feature-anomaly_model
…MSstatsConvert into feature-anomaly_model
tests(spectronaut): add unit tests for anomaly model feature and spectronaut
WalkthroughAdds anomaly-detection and data-health capabilities: new Rcpp isolation-forest implementation (with OpenMP and R wrappers), R preprocessing/feature-engineering and validation utilities, propagation of anomaly_metrics through preprocessing and balanced-design, Spectronaut cleaning/converter options to compute AnomalyScores, new public functions MSstatsAnomalyScores and CheckDataHealth, expanded tests and documentation. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User
participant Converter as SpectronauttoMSstatsFormat
participant Validator as .validateMSstatsConverterParameters
participant Cleaner as MSstatsClean/.cleanRawSpectronaut
participant Preproc as MSstatsPreprocess/MSstatsBalancedDesign
participant AnomPrep as .prepareSpectronautAnomalyInput
participant ModelR as .runAnomalyModel / MSstatsAnomalyScores
participant Cpp as calculate_anomaly_score (Rcpp)
User->>Converter: call(..., calculateAnomalyScores?, anomalyModelFeatures, runOrder, n_trees, max_depth, numberOfCores, excludedFromQuantificationFilter)
Converter->>Validator: validate config
Validator-->>Converter: OK
Converter->>Cleaner: MSstatsClean(intensity, calculateAnomalyScores, anomalyModelFeatures)
Cleaner-->>Converter: cleaned data
Converter->>Preproc: MSstatsPreprocess(..., anomaly_metrics) → MSstatsBalancedDesign(..., anomaly_metrics)
Preproc-->>Converter: balanced data
alt calculateAnomalyScores == TRUE
Converter->>AnomPrep: build PSMs, scale, add temporal features
AnomPrep-->>ModelR: enriched data (split by PSM)
ModelR->>Cpp: per-PSM call calculate_anomaly_score(n_trees, max_depth)
Cpp-->>ModelR: avg_depth, anomaly_score
ModelR-->>Converter: attach AnomalyScores
else
Note over Converter: Skip anomaly scoring
end
Converter-->>User: MSstats-format data (with optional AnomalyScores)
sequenceDiagram
autonumber
participant User
participant Health as CheckDataHealth
participant DH as utils_data_health
User->>Health: CheckDataHealth(input)
Health->>DH: .checkMissing, .checkIntensityDistribution
Health->>DH: .checkFeatureSD -> .checkFeatureOutliers -> .checkFeatureCoverage
opt AnomalyScores present
Health->>DH: .checkAnomalySkew
end
DH-->>Health: metrics, tables
Health-->>User: list(feature_data, skew_results)
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Suggested reviewers
Pre-merge checks (2 warnings, 1 inconclusive)❌ Failed checks (2 warnings, 1 inconclusive)
Poem
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 36
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
R/utils_clean_features.R (1)
96-122: All-NA groups get dropped in anomaly path; fix selection to be NA-safe and stablemax(Intensity, na.rm=TRUE) returns -Inf for all-NA groups, causing no join match and silent row loss. Also, make tie-breaking deterministic.
- } else if (length(anomaly_metrics) > 0){ - - input[, row_id := .I] - max_per_group = input[, .(max_intensity = max(Intensity, na.rm = TRUE)), - by = feature_columns] - - input = input[max_per_group, - on = c(feature_columns, "Intensity" = "max_intensity"), - nomatch = 0L, mult = "first"] - input[, row_id := NULL] - + } else if (length(anomaly_metrics) > 0){ + input[, row_id := .I] + # Replace NA with -Inf for selection only; keep original Intensity + input[, Intensity2 := ifelse(is.na(Intensity), -Inf, Intensity)] + max_per_group = input[, .(max_intensity = max(Intensity2)), by = feature_columns] + data.table::setorder(input, row_id) # ensure 'first' is stable + input = input[max_per_group, + on = c(feature_columns, "Intensity2" = "max_intensity"), + nomatch = 0L, mult = "first"] + input[, c("row_id", "Intensity2") := NULL]man/CheckDataHealth.Rd (1)
1-18: Align Rd documentation with implementation
Update the \value section to describe a list with two data.tables—feature_data(containing anOrdercolumn plus moving-window features such asmean_increase,mean_decrease,dispersion_increasegenerated by.add_features) andskew_results(with columnsPSMandskew, the Pearson skewness ofAnomalyScores).
🧹 Nitpick comments (48)
R/utils_filtering.R (2)
93-97: Doc param should state it accepts a vector.Roxygen says “regular expression”; function treats
patternsas a vector. Clarify to “character vector of regular expressions”.Apply in roxygen (this file), then re-run roxygen:
-#' @param patterns chr, regular expression - matching peptides will be +#' @param patterns chr, vector of regular expressions — matching peptides will be
107-114: Vectorize regex filtering for clarity and speed.Current loop builds an AND over negated
grepl. Equivalent, simpler, and faster is a singlegreplwith an OR’d pattern.- pattern_filter = rep(TRUE, nrow(input)) - for (pattern in patterns) { - pattern_filter = pattern_filter & !grepl(pattern, input[[col_name]]) - } + if (length(patterns)) { + or_pat <- paste0("(", paste(patterns, collapse = ")|("), ")") + pattern_filter = !grepl(or_pat, input[[col_name]]) + } else { + pattern_filter = rep(TRUE, nrow(input)) + }src/Makevars.win (1)
1-2: Use SHLIB_OPENMP_CFLAGS in PKG_LIBS and place it last (per R-exts guidance).Improves cross-platform linking; keep BLAS/LAPACK/FLIBS before the OpenMP flag.
PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS) -PKG_LIBS = $(SHLIB_OPENMP_CXXFLAGS) $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) +PKG_LIBS = $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) $(SHLIB_OPENMP_CFLAGS)man/dot-filterFewMeasurements.Rd (1)
23-23: Clarify type: use “character vector” and Rd code markup.“chr” is informal; prefer Rd-consistent wording.
-\item{feature_columns}{chr, vector of names of columns that define features.} +\item{feature_columns}{\code{character} vector of names of columns that define features.}man/dot-getFullDesign.Rd (1)
18-18: Use Rd markup instead of backticks; consider argument order alignment.Switch to \code{}; optional: list arguments in the same order as in usage().
-\item{`feature_col`}{name of the column that labels features} +\item{\code{feature_col}}{name of the column that labels features}src/Makevars (1)
1-2: Mirror OpenMP linking change on non-Windows builds.Place OpenMP flag last and use SHLIB_OPENMP_CFLAGS for linking.
PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS) -PKG_LIBS = $(SHLIB_OPENMP_CXXFLAGS) $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) +PKG_LIBS = $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS) $(SHLIB_OPENMP_CFLAGS)man/MSstatsPreprocess.Rd (1)
61-61: Align default description with signature.Default is c(), not “missing”.
-\item{anomaly_metrics}{character vector of names of columns with quality metrics. Default is missing and is not required if anomaly model not run.} +\item{anomaly_metrics}{character vector of names of columns with quality metrics. Default is \code{c()} and is not required unless the anomaly model is run.}man/MSstatsConvert.Rd (1)
21-34: Confirm intent to mark package doc as internal and verify author emails
- \keyword{internal} will hide the package page from typical help indices; confirm that’s intended.
- Please verify author emails (e.g., “wu.anthon@…” looks truncated).
man/dot-summarizeMultipleMeasurements.Rd (1)
7-12: Clarify behavior when anomaly_metrics is providedWhen anomaly_metrics ≠ c(), implementation selects the single row per group with max Intensity (not aggregator()). Please note this explicitly in Description/Arguments to avoid confusion.
Also applies to: 21-22
R/utils_balanced_design.R (2)
19-24: Guard against row blow-up when merging anomaly metrics in fill_missing pathIf multiple rows exist per intensity_ids (same feature/run/channel/etc.) with differing anomaly metrics, merge may duplicate all_possibilities. Deduplicate per intensity_ids before the merge.
- intensities = intersect(c(intensity_ids, "Intensity", - "isZero", anomaly_metrics), - colnames(input)) - input = merge(all_possibilities, - unique(input[, intensities, with = FALSE]), - all.x = TRUE, by = intensity_ids) + intensities = intersect(c(intensity_ids, "Intensity", "isZero", anomaly_metrics), + colnames(input)) + intens_dt = unique(input[, intensities, with = FALSE]) + # ensure uniqueness per key to avoid cartesian expansion + intens_dt = intens_dt[ , .SD[1L], by = intensity_ids] + input = merge(all_possibilities, intens_dt, all.x = TRUE, by = intensity_ids)Also applies to: 44-49
68-71: Roxygen param names shouldn’t be backtickedUse @param feature_col and @param measurement_col (without backticks) so roxygen maps them correctly.
-#' @param `feature_col` name of the column that labels features -#' @param `measurement_col` name of a column with measurement labels - Runs in +#' @param feature_col name of the column that labels features +#' @param measurement_col name of a column with measurement labels - Runs inman/dot-makeBalancedDesign.Rd (1)
12-16: Minor docs polishCombine the anomaly_metrics sentence; “and marked as NA” reads detached here.
-\item{anomaly_metrics}{character vector of quality metric column names to be -used as features in an anomaly detection model. -and marked as NA} +\item{anomaly_metrics}{character vector of quality-metric column names to be +used as features in an anomaly detection model; missing values are marked as \code{NA}.}DESCRIPTION (2)
27-29: Rcpp and stringi in Imports — OK; consider SystemRequirements for C++/OpenMPGiven the new C++ isolation forest (and OpenMP per PR description), declare SystemRequirements to keep CRAN/Bioc checks clean.
Encoding: UTF-8 LazyData: true Roxygen: list(markdown = TRUE) RoxygenNote: 7.3.2 biocViews: MassSpectrometry, Proteomics, Software, DataImport, QualityControl +SystemRequirements: C++11, OpenMP
51-51: Collate includes RcppExports.R — optional ordering tweakRcppExports.R can be placed early in Collate for clarity, though not required if no top-level calls happen before load.
R/RcppExports.R (1)
1-2: Gitleaks false positive on generator tokenThe generator token line is auto-generated by Rcpp; it’s not a secret. If your CI flags it, add an allowlist rule for this pattern.
Example gitleaks allowlist:
[rules.allowlist] regexTarget = "match" regexes = ['10BE3573-1514-4C36-9D1C-5A225CD40393']R/utils_classes.R (1)
12-14: Threading anomaly_metrics into column selection — OK; validate names earlyLooks correct. Consider validating that all anomaly_metrics exist in input and warning on drops to aid users.
Example (in .selectMSstatsColumns or here):
missing <- setdiff(anomaly_metrics, names(input)) if (length(missing)) warning("Unknown anomaly_metrics dropped: ", paste(missing, collapse = ", "))man/dot-cleanRawSpectronaut.Rd (1)
7-12: Document defaults for new paramsIf the implementation uses defaults (e.g., calculateAnomalyScores = FALSE, anomalyModelFeatures = NULL), surface them in usage to avoid ambiguity.
-.cleanRawSpectronaut( - msstats_object, - intensity, - calculateAnomalyScores, - anomalyModelFeatures -) +.cleanRawSpectronaut( + msstats_object, + intensity, + calculateAnomalyScores = FALSE, + anomalyModelFeatures = NULL +)Also ensure these defaults are present in R/clean_Spectronaut.R roxygen.
man/CheckDataHealth.Rd (3)
5-5: Clarify scope and how to satisfy the anomaly-model requirement.Avoid tool-specific phrasing and tell users how to fit/provide the model (e.g., via MSstatsAnomalyScores) so the prerequisite is actionable.
Apply this doc tweak:
-\title{Takes as input the output of the SpectronauttoMSstatsFormat function and calculates various quality metrics to assess the health of the data. Requires Anomaly Detection model to be fit.} +\title{Compute data-health metrics for MSstats-formatted input. Requires an anomaly-detection model (e.g., computed via \code{MSstatsAnomalyScores}) to be available.}-\description{ -Takes as input the output of the SpectronauttoMSstatsFormat function and calculates various quality metrics to assess the health of the data. Requires Anomaly Detection model to be fit. -} +\description{ +Takes as input an MSstats-formatted table (e.g., from \code{SpectronauttoMSstatsFormat}) and calculates quality metrics to assess data health. Requires that anomaly scores have been computed (e.g., via \code{MSstatsAnomalyScores}) or provided in the input. +}Also applies to: 16-16
10-11: Broaden and specify the input contract.Make clear it accepts generic MSstats-format, not only Spectronaut, and state required columns.
-\item{input}{MSstats input which is the output of Spectronaut converter} +\item{input}{A \code{data.table} in MSstats format (e.g., from \code{SpectronauttoMSstatsFormat}); must contain standard MSstats columns and, if available, anomaly-score columns.}
12-14: Document the returned list elements explicitly.Naming the two elements reduces ambiguity for downstream users/tests.
-\value{ -list of two data.tables -} +\value{ +A named list of two \code{data.table}s: +\itemize{ + \item \code{feature_data}: per-feature metrics (e.g., coverage, dispersion, outliers). + \item \code{skew_results}: distributional summaries (e.g., skewness of \code{AnomalyScores}). +} +}man/MSstatsBalancedDesign.Rd (1)
39-40: Explain how anomaly_metrics are used.State that these columns are carried through/considered in selection and any constraints.
-\item{anomaly_metrics}{character vector of names of columns with quality metrics} +\item{anomaly_metrics}{Character vector of column names with quality/peak metrics (e.g., anomaly scores) to retain and propagate during balancing and feature selection. Defaults to \code{character(0)} (none).}man/MSstatsClean.Rd (1)
115-118: Tighten argument docs and interaction-\item{calculateAnomalyScores}{logical, whether to calculate anomaly scores} +\item{calculateAnomalyScores}{Logical; if \code{TRUE}, compute anomaly scores during cleaning (default: \code{FALSE}).} -\item{anomalyModelFeatures}{character vector, specifies which columns will be used for anomaly detection model. Can be NULL if calculateAnomalyScores=FALSE.} +\item{anomalyModelFeatures}{Character vector of column names used as features by the anomaly-detection model (ignored when \code{calculateAnomalyScores = FALSE}). Typically includes intensity or quality metrics per PSM/fragment.}inst/tinytest/test_utils_anomaly_score.R (6)
103-113: Correlation threshold is brittle without determinism.With seeding, this is fine; otherwise lower the threshold or switch to Spearman on ranks.
-correlation = cor(scores1$AnomalyScores, scores2$AnomalyScores) -expect_true(correlation > 0.95, +correlation = cor(scores1$AnomalyScores, scores2$AnomalyScores, method = "spearman") +expect_true(correlation > 0.9, info = "Multiple runs should produce correlated results")
296-303: Use explicit tolerances for floating-point equality.Be robust to minor numeric differences.
-expect_equal( +expect_equal( mean_increase, c(0.00, 0.00, 0.26, 0.00, 0.00, 0.00, 0.50, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.55, 0.87, 1.11, 1.42, 1.71, 2.94), - info = "Mean increase calculation should match expected values" + tolerance = 1e-8, + info = "Mean increase calculation should match expected values" )-expect_equal( +expect_equal( mean_decrease, c(0.00, 0.00, 0.00, 0.29, 0.50, 0.56, 0.00, 1.16, 0.99, 0.15, 0.00, 0.00, 1.27, 1.84, 0.29, 0.00, 0.00, 0.00, 0.00, 0.00), - info = "Mean decrease calculation should match expected values" + tolerance = 1e-8, + info = "Mean decrease calculation should match expected values" )Also applies to: 305-313
243-245: Remove noisy console output in tests.Printing can clutter CI logs; prefer assertions only.
-cat("Top 5 anomalous rows:", order(regression_scores$AnomalyScores, decreasing = TRUE)[1:5], "\n") -cat("Score range:", range(regression_scores$AnomalyScores), "\n") +invisible(NULL)-cat("All anomaly detection tests completed successfully!\n") +invisible(NULL)Also applies to: 324-324
275-282: Fix mismatched comment about expected ranking.The data make Row 6 highest, Row 5 second, Row 4 third.
-# Row 5 should have highest score, Row 4 second highest, etc. +# Expected: Row 6 > Row 5 > Row 4 > Rows 1–3
8-16: Prefer public APIs in tests where possible.Relying on non-exported functions via \code{:::} is brittle. Add at least one integration test using \code{MSstatsAnomalyScores} to validate end-to-end behavior; keep internals for unit tests.
I can draft a tinytest block invoking MSstatsAnomalyScores on a minimal dataset and asserting expected ordering—want me to add it?
Also applies to: 292-295
9-11: Optional: speed up CI by reducing trees.For tiny synthetic data, fewer trees should preserve ordering and cut runtime.
- anomaly_output = MSstatsConvert:::.runAnomalyModel(df, - n_trees=100, + anomaly_output = MSstatsConvert:::.runAnomalyModel(df, + n_trees=50, max_depth="auto",man/MSstatsAnomalyScores.Rd (2)
24-36: Specify allowed values and semantics for temporal_directionList the permitted values (e.g., mean_increase, mean_decrease, dispersion_increase), required length = length(quality_metrics), and what each produces. Add validation notes if mismatched.
38-43: Return value is underspecifiedDocument the returned columns (e.g., AnomalyScores, avg_depth), their ranges, and how they map back to input rows/runs. This helps downstream users/tests.
man/dot-validateMSstatsConverterParameters.Rd (1)
18-45: Tighten parameter contracts (types and ranges) in docs to match validationBe explicit:
- n_trees: integer > 0
- max_depth: integer > 0 or "auto"
- removeMissingFeatures: 0 ≤ x ≤ 1 (fraction)
- anomalyModelFeatureCount: integer ≥ 1
- numberOfCores: integer ≥ 1
- anomalyModelFeatures/anomalyModelFeatureTemporal: non-empty; temporal values from a fixed set
Update details to reflect these constraints so tests and users know expected types.inst/tinytest/test_cleanRaw.R (2)
198-204: Also assert presence of the new columns, not just column countCounting columns can pass while names drift. Add checks for EGQvalue/PGQvalue, FPossibleInterference, FExcludedFromQuantification when calculateAnomalyScores = FALSE.
expect_equal(ncol(sn_cleaned), 12) expect_true(nrow(sn_cleaned) > 0) +expect_true(all(c("EGQvalue","FPossibleInterference","FExcludedFromQuantification") %in% colnames(sn_cleaned)) || + all(c("PGQvalue","FPossibleInterference","FExcludedFromQuantification") %in% colnames(sn_cleaned)))
205-221: Strengthen anomaly-feature assertionsVerify that requested anomalyModelFeatures are actually present post-cleaning (not just column count).
expect_equal(ncol(sn_cleaned_2), 16) expect_true(nrow(sn_cleaned_2) > 0) +expect_true(all(c("FGShapeQualityScore(MS2)", + "FGShapeQualityScore(MS1)", + "EGApexRT") %in% colnames(sn_cleaned_2)))src/RcppExports.cpp (1)
13-24: Optional: avoid copying DataFrame at the C++ boundaryIf performance becomes a concern, change calculate_anomaly_score signature to
const DataFrame&and regenerate Rcpp exports. Not urgent.inst/tinytest/test_converters_SpectronauttoMSstatsFormat.R (1)
60-88: Minor: make temporal ordering deterministic across environmentsUsing seq(1:nrow(temporal)) is fine, but rely on an explicit order (e.g., by Run) to avoid potential nondeterminism from unique().
-temporal$Order = seq(1:nrow(temporal)) +data.table::setorder(temporal, Run) +temporal$Order = seq_len(nrow(temporal))R/clean_Spectronaut.R (2)
19-26: Consistency: if FExcludedFromQuantification is required, don’t treat it as optional.validateSpectronautInput requires FExcludedFromQuantification. Using .findAvailable for exclude_col implies optional. Either remove it from required list or include directly here without .findAvailable.
- exclude_col = .findAvailable(c("FExcludedFromQuantification"), - colnames(spec_input)) + exclude_col = "FExcludedFromQuantification"
23-27: Broaden EG Q-value alias handlingOlder Spectronaut exports sometimes use EG.Q.Value. Consider .findAvailable(c("EGQvalue","EG.Q.Value")).
- cols = c("PGProteinGroups", "EGModifiedSequence", "FGCharge", "FFrgIon", - f_charge_col, "RFileName", "RCondition", "RReplicate", - "EGQvalue", pg_qval_col, interference_col, exclude_col, + eg_qval_col = .findAvailable(c("EGQvalue","EG.Q.Value"), colnames(spec_input)) + cols = c("PGProteinGroups", "EGModifiedSequence", "FGCharge", "FFrgIon", + f_charge_col, "RFileName", "RCondition", "RReplicate", + eg_qval_col, pg_qval_col, interference_col, exclude_col, paste0("F", intensity))inst/tinytest/test_utils_data_health.R (2)
126-130: Make the missing-coverage assertion deterministic.Using result_missing$Feature[1] is order-dependent. Select the known Feature you NAd to avoid flaky tests.
-first_feature = result_missing$Feature[1] -first_feature_missing = result_missing[Feature == first_feature]$percent_missing +target_feature = paste("AEFEEQNVR", "2", "y3", "1", sep="_") +first_feature_missing = result_missing[Feature == target_feature]$percent_missing
161-164: Remove debug note and clarify finite skew expectation.The “(FAIL)” note is misleading. If there’s any per-PSM uniformity upstream, skew can be NaN. Either remove the note or widen the check to match your uniform-case logic.
-# Check that skewness values are numeric and finite (FAIL) -expect_true(all(is.numeric(result$skew)), info = ".checkAnomalySkew returns numeric skewness values") -expect_true(all(is.finite(result$skew)), info = ".checkAnomalySkew returns finite skewness values") +# Check that skewness values are numeric and well-formed +expect_true(is.numeric(result$skew), info = ".checkAnomalySkew returns numeric skewness values") +expect_true(all(is.finite(result$skew) | is.nan(result$skew)), + info = ".checkAnomalySkew returns finite/NaN skewness depending on variance")R/utils_checks.R (1)
229-231: Be permissive for integer-like run order and avoid partial recycling.is.numeric() excludes integer64/factors; assertNumber on the vector is clearer. Also enforce no NA.
- if (!is.numeric(config$runOrder$Order)) { - stop("runOrder$Order must be numeric") - } + checkmate::assertNumeric(config$runOrder$Order, any.missing = FALSE, all.missing = FALSE)src/isolation_forest.cpp (1)
15-15: Fixed RNG seed hurts reproducibility across sessions and users.Either accept a seed from R or draw from R’s RNG via R::rnorm/Runif or pass a seed parameter.
R/utils_data_health.R (1)
41-41: Magic number 0.5 should be parameterizedThe threshold value 0.5 for high ratio detection is hardcoded. Consider making this configurable for flexibility.
Consider adding a parameter to make the threshold configurable:
-.checkFeatureSD = function(input){ +.checkFeatureSD = function(input, ratio_threshold = 0.5){ sd_data = input[, .(sd_Intensity = sd(log2(Intensity), na.rm = TRUE), mean_Intensity = mean(log2(Intensity), na.rm = TRUE), ratio = sd(log2(Intensity), na.rm = TRUE) / mean( log2(Intensity), na.rm = TRUE)), by = Feature] - high_ratio = nrow(sd_data[sd_data$ratio > .5]) / nrow(sd_data) + high_ratio = nrow(sd_data[sd_data$ratio > ratio_threshold]) / nrow(sd_data)R/utils_anomaly_score.R (1)
10-15: Consider using data.table syntax more efficientlyThe Fragment and PSM construction could be more efficient using data.table's
:=operator.- input$Fragment = paste(input$PeptideSequence, - input$PrecursorCharge, - input$FragmentIon, - input$ProductCharge, sep="_") - input$PSM = paste(input$PeptideSequence, - input$PrecursorCharge, sep="_") + input[, `:=`( + Fragment = paste(PeptideSequence, PrecursorCharge, + FragmentIon, ProductCharge, sep="_"), + PSM = paste(PeptideSequence, PrecursorCharge, sep="_") + )]man/SpectronauttoMSstatsFormat.Rd (2)
60-60: Documentation inconsistency for anomalyModelFeatureTemporalThe documentation mentions possible values but doesn't clarify if multiple temporal features can be applied to the same quality metric.
Consider clarifying the documentation:
-\item{anomalyModelFeatureTemporal}{character vector of temporal direction corresponding to columns passed to anomalyModelFeatures. Values must be one of: \code{mean_decrease}, \code{mean_increase}, \code{dispersion_increase}, or NULL (to perform no temporal feature engineering). Default is empty vector. If calculateAnomalyScores=TRUE, vector must have as many values as anomalyModelFeatures (even if all NULL).} +\item{anomalyModelFeatureTemporal}{character vector of temporal direction corresponding to columns passed to anomalyModelFeatures. Each value must be one of: \code{mean_decrease}, \code{mean_increase}, \code{dispersion_increase}, or NULL (to perform no temporal feature engineering). Default is empty vector. If calculateAnomalyScores=TRUE, vector must have exactly as many values as anomalyModelFeatures, with a one-to-one correspondence (even if all NULL).}
70-70: Documentation could be clearer about max_depth calculationThe automatic depth calculation formula should be more explicit.
-\item{max_depth}{Max tree depth to use in isolation forest when calculateAnomalyScores=TRUE. Default is "auto" which calculates depth as log2(N) where N is the number of runs. Otherwise must be an integer.} +\item{max_depth}{Max tree depth to use in isolation forest when calculateAnomalyScores=TRUE. Default is "auto" which calculates depth as log2(N) where N is the number of observations per PSM. Otherwise must be a positive integer.}R/MSstatsConvert_core_functions.R (3)
325-325: Fix documentation formattingThe documentation comment has inconsistent formatting. Missing space after "model" and before "not".
-#' @param anomaly_metrics character vector of names of columns with quality metrics. Default is missing and is not required if anomaly model not run. +#' @param anomaly_metrics character vector of names of columns with quality metrics. Default is missing and is not required if anomaly model is not run.
564-567: Use column selection by reference for efficiencyThe current approach creates a character vector of column names and then uses
..prefix. Consider using.SDand.SDcolsfor better performance and clarity.- subset_cols = subset_cols[subset_cols %in% names(input)] - input = input[, ..subset_cols] + subset_cols = subset_cols[subset_cols %in% names(input)] + input = input[, .SD, .SDcols = subset_cols]
531-531: Documentation: Add @importFrom directiveThe function uses Rcpp through
.Callbut the directive@useDynLibis included. Consider also adding@importFrom Rcpp sourceCppif the package uses Rcpp modules.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (5)
inst/tinytest/raw_data/Spectronaut/annotation.csvis excluded by!**/*.csvinst/tinytest/raw_data/Spectronaut/spectronaut_quality_input.csvis excluded by!**/*.csvsrc/MSstatsConvert.sois excluded by!**/*.sosrc/RcppExports.ois excluded by!**/*.osrc/isolation_forest.ois excluded by!**/*.o
📒 Files selected for processing (42)
DESCRIPTION(3 hunks)NAMESPACE(3 hunks)R/MSstatsConvert_core_functions.R(8 hunks)R/RcppExports.R(1 hunks)R/clean_Spectronaut.R(1 hunks)R/converters_SpectronauttoMSstatsFormat.R(4 hunks)R/utils_MSstatsConvert.R(1 hunks)R/utils_annotation.R(1 hunks)R/utils_anomaly_score.R(1 hunks)R/utils_balanced_design.R(4 hunks)R/utils_checks.R(1 hunks)R/utils_classes.R(1 hunks)R/utils_clean_features.R(5 hunks)R/utils_data_health.R(1 hunks)R/utils_dt_operations.R(1 hunks)R/utils_filtering.R(1 hunks)inst/tinytest/test_cleanRaw.R(1 hunks)inst/tinytest/test_converters_SpectronauttoMSstatsFormat.R(1 hunks)inst/tinytest/test_utils_anomaly_score.R(1 hunks)inst/tinytest/test_utils_data_health.R(1 hunks)man/CheckDataHealth.Rd(1 hunks)man/MSstatsAnomalyScores.Rd(1 hunks)man/MSstatsBalancedDesign.Rd(2 hunks)man/MSstatsClean.Rd(2 hunks)man/MSstatsConvert.Rd(2 hunks)man/MSstatsPreprocess.Rd(2 hunks)man/SpectronauttoMSstatsFormat.Rd(3 hunks)man/dot-MSstatsFormat.Rd(1 hunks)man/dot-cleanByFeature.Rd(2 hunks)man/dot-cleanRawSpectronaut.Rd(1 hunks)man/dot-filterByPattern.Rd(1 hunks)man/dot-filterFewMeasurements.Rd(1 hunks)man/dot-getFullDesign.Rd(1 hunks)man/dot-makeBalancedDesign.Rd(1 hunks)man/dot-mergeAnnotation.Rd(1 hunks)man/dot-selectMSstatsColumns.Rd(1 hunks)man/dot-summarizeMultipleMeasurements.Rd(1 hunks)man/dot-validateMSstatsConverterParameters.Rd(1 hunks)src/Makevars(1 hunks)src/Makevars.win(1 hunks)src/RcppExports.cpp(1 hunks)src/isolation_forest.cpp(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/RcppExports.cpp (1)
src/isolation_forest.cpp (2)
calculate_anomaly_score(167-202)calculate_anomaly_score(167-171)
src/isolation_forest.cpp (1)
src/RcppExports.cpp (3)
df(19-19)max_depth(21-21)n_trees(20-20)
🪛 Gitleaks (8.27.2)
R/RcppExports.R
[high] 2-2: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (27)
R/utils_MSstatsConvert.R (1)
15-15: Confirm intent: marking the package-level help page as internal hides it from users.If unintended, drop the internal keyword to keep ?MSstatsConvert visible.
-#' @keywords internalman/MSstatsConvert.Rd (1)
3-6: DocType/alias changes look fineR/utils_balanced_design.R (1)
4-6: Signature and docs addition look goodAlso applies to: 9-9
NAMESPACE (3)
30-30: Rcpp import matches the Rcpp glue — OKThis pairs correctly with LinkingTo/Imports in DESCRIPTION and the RcppExports.* files. No action needed.
44-44: useDynLib with .registration = TRUE — correctMatches Rcpp routine registration; keeps symbol lookup strict. Approved.
9-9: Approve export of MSstatsAnomalyScores — all wiring and documentation verified.
Function definition exists, roxygen@export, NAMESPACE export, and an Rd page are all present.DESCRIPTION (2)
66-66: New utils files in Collate — OKutils_anomaly_score.R and utils_data_health.R are listed; aligns with new exports.
Also applies to: 71-71
33-35: LinkingTo: Rcpp — Makevars/Makevars.win already set PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS), so compiler and OpenMP flags are in place.R/RcppExports.R (1)
4-6: calculate_anomaly_score wrapper: all call sites supplymax_depth— a search ofR/**/*.Rfound only the call inR/utils_anomaly_score.Rpassing three arguments (single_psm, n_trees, max_depth), so no convenience wrapper is needed.R/utils_classes.R (1)
8-8: Param doc reads wellClear description of anomaly_metrics. Approved.
man/dot-MSstatsFormat.Rd (1)
7-13: Docs align with code — goodUsage and argument docs for anomaly_metrics are consistent with R/utils_classes.R.
man/dot-cleanByFeature.Rd (1)
7-12: Anomaly-metrics propagation docs — LGTMSignature and argument description are clear and backward compatible.
Also applies to: 22-23
man/MSstatsBalancedDesign.Rd (1)
13-15: Defaults and propagation ofanomaly_metricsverified. Parameter is correctly forwarded to both .makeBalancedDesign and .selectMSstatsColumns.src/RcppExports.cpp (1)
13-24: Registration wiring verified
NAMESPACE containsuseDynLib(MSstatsConvert, .registration = TRUE), R/RcppExports.R calls.Call("_MSstatsConvert_calculate_anomaly_score",…), and the C++ signature in src/isolation_forest.cpp matches exactly. No changes needed.R/clean_Spectronaut.R (1)
8-11: No upstream .cleanRawSpectronaut call sites detected
rg -nP '\.cleanRawSpectronaut\s*\(' --type=rreturned no matches; confirm any external or downstream callers before merging.inst/tinytest/test_utils_data_health.R (1)
205-235: Good coverage of skewness edge cases.The skewed, uniform, and mixed-score scenarios are helpful for robustness.
R/utils_checks.R (3)
80-113: Parameter defaults/readability LGTM.Centralized defaults and modifyList() merge are clear and maintainable.
81-121: max_depth “auto” is correctly converted: .runAnomalyModel checks formax_depth == "auto"and sets it toround(log2(nrow(...)))before callingcalculate_anomaly_score(R/utils_anomaly_score.R 185–187).
249-251: Write-permission check has operator-precedence bug.!file.access(..) == 0 is always FALSE. Compare directly against 0.
- if (!file.access(log_dir, mode = 2) == 0) { - stop("No write permission for log file directory: ", log_dir) - } + if (file.access(log_dir, mode = 2) != 0) { + stop("No write permission for log file directory: ", log_dir) + }Likely an incorrect or invalid review comment.
src/isolation_forest.cpp (1)
198-201: Column names in return value: confirm downstream expectations.Downstream code must expect avg_depth and anomaly_score. Verify R wrappers/tests use these exact names.
R/utils_anomaly_score.R (1)
197-197: Anomaly score clipping uses incorrect functionUsing
pmaxwith 0.001 as minimum value, but the comment says "clip to stop them from exploding" which suggests you want to limit maximum values, not minimum.Please verify the intended behavior. If you want to limit maximum values, use
pmin. If you want to ensure minimum values (e.g., avoid zeros), the current code is correct but the comment should be updated:model_results = unlist(model_results) - # Clip anomaly scores to stop them from exploding + # Ensure minimum anomaly score of 0.001 to avoid zero values input_data$AnomalyScores = pmax(model_results, .001)man/SpectronauttoMSstatsFormat.Rd (1)
18-26: Validation for anomaly detection parameters is already implemented
The helper.validateMSstatsConverterParametersenforces that whencalculateAnomalyScores=TRUE,anomalyModelFeaturesis non-empty,anomalyModelFeatureTemporal(if provided) matches length and allowed values, and validatesn_trees,max_depth,anomalyModelFeatureCount,removeMissingFeatures, andrunOrder.R/MSstatsConvert_core_functions.R (3)
367-367: LGTM! Clean integration of anomaly metricsThe addition of the
anomaly_metricsparameter with a sensible default value integrates well with the existing function signature.
384-387: Parameter propagation looks correctThe
anomaly_metricsparameter is correctly passed through to the internal.cleanByFeaturefunction, maintaining the data flow for anomaly detection features.
412-412: LGTM! Consistent propagation of anomaly metrics through the pipelineThe
anomaly_metricsparameter is correctly added toMSstatsBalancedDesignand properly propagated through to.makeBalancedDesignand.MSstatsFormatfunctions, ensuring the anomaly detection data flows through the entire preprocessing pipeline.Also applies to: 429-429, 442-442, 452-452
R/converters_SpectronauttoMSstatsFormat.R (2)
48-74: LGTM! Comprehensive parameter validation setupThe validation configuration object comprehensively captures all function parameters for validation, which is a good practice for maintaining data integrity.
79-79: Add validation for standardized feature names
AfteranomalyModelFeatures = .standardizeColnames(anomalyModelFeatures), verify they’re present in the data—for example:missing_feats <- setdiff(anomalyModelFeatures, colnames(input)) if (length(missing_feats)) { stop("Unknown anomalyModelFeatures after standardization: ", paste(missing_feats, collapse=", ")) }
| # runOrder not provided | ||
| expect_error(SpectronauttoMSstatsFormat( | ||
| spectronaut_raw, | ||
| annotation = annotation, | ||
| calculateAnomalyScores = TRUE, | ||
| anomalyModelFeatures = c("FG.ShapeQualityScore (MS2)", | ||
| "FG.ShapeQualityScore (MS1)", | ||
| "EG.ApexRT"), | ||
| anomalyModelFeatureTemporal = c("mean_decrease", | ||
| "mean_increase", | ||
| "dispersion_increase"), | ||
| removeMissingFeatures = 0.5, | ||
| anomalyModelFeatureCount = 100, | ||
| runOrder = NULL, | ||
| n_trees = 100, | ||
| max_depth = "auto" | ||
| )) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
annotation is referenced but not defined in this test file
The error-path tests pass annotation = annotation without creating annotation. Define a minimal annotation data.frame before these blocks to avoid NameError.
# Spectronaut quality features error handling -------------------------------
# runOrder not provided
+annotation <- data.table::data.table(
+ Run = unique(spectronaut_raw$R.FileName),
+ Condition = spectronaut_raw$R.Condition[match(unique(spectronaut_raw$R.FileName), spectronaut_raw$R.FileName)],
+ BioReplicate = seq_along(unique(spectronaut_raw$R.FileName))
+)
expect_error(SpectronauttoMSstatsFormat(🤖 Prompt for AI Agents
In inst/tinytest/test_converters_SpectronauttoMSstatsFormat.R around lines
111–127, the test calls expect_error(..., annotation = annotation) but no
annotation object is defined; create a minimal annotation data.frame immediately
before these test blocks (or at top of the file) that includes the columns the
SpectronauttoMSstatsFormat function expects (e.g., run identifier and any
required metadata such as Run/Run.ID, Condition, BioReplicate) with appropriate
types and at least one row so the tests can reference annotation without raising
a NameError.
| anomaly_output = MSstatsConvert:::.runAnomalyModel(df, | ||
| n_trees=100, | ||
| max_depth="auto", | ||
| cores=1, | ||
| split_column="PSM", | ||
| quality_metrics=c("QualityMetric.mean_increase", | ||
| "QualityMetric.mean_decrease", | ||
| "QualityMetric.dispersion_increase")) | ||
| return(anomaly_output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seed the RNG to prevent flaky tests.
Isolation Forest is stochastic; many assertions depend on ranking/correlation. Set a seed inside the helper.
run_quality_metrics = function(df, mean_increase, mean_decrease, dispersion_increase) {
+ set.seed(42)
df$QualityMetric.mean_increase = mean_increase
df$QualityMetric.mean_decrease = mean_decrease
df$QualityMetric.dispersion_increase = dispersion_increase
anomaly_output = MSstatsConvert:::.runAnomalyModel(df,
- n_trees=100,
+ n_trees=100,
max_depth="auto",
cores=1,If available, prefer passing a deterministic seed/random_state into the model instead of globally seeding.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| anomaly_output = MSstatsConvert:::.runAnomalyModel(df, | |
| n_trees=100, | |
| max_depth="auto", | |
| cores=1, | |
| split_column="PSM", | |
| quality_metrics=c("QualityMetric.mean_increase", | |
| "QualityMetric.mean_decrease", | |
| "QualityMetric.dispersion_increase")) | |
| return(anomaly_output) | |
| run_quality_metrics = function(df, mean_increase, mean_decrease, dispersion_increase) { | |
| set.seed(42) | |
| df$QualityMetric.mean_increase = mean_increase | |
| df$QualityMetric.mean_decrease = mean_decrease | |
| df$QualityMetric.dispersion_increase = dispersion_increase | |
| anomaly_output = MSstatsConvert:::.runAnomalyModel( | |
| df, | |
| n_trees = 100, | |
| max_depth = "auto", | |
| cores = 1, | |
| split_column = "PSM", | |
| quality_metrics = c( | |
| "QualityMetric.mean_increase", | |
| "QualityMetric.mean_decrease", | |
| "QualityMetric.dispersion_increase" | |
| ) | |
| ) | |
| return(anomaly_output) | |
| } |
🤖 Prompt for AI Agents
In inst/tinytest/test_utils_anomaly_score.R around lines 8 to 16, the call to
MSstatsConvert:::.runAnomalyModel is nondeterministic and makes tests flaky;
either seed the RNG locally before calling the helper (e.g. call
set.seed(<fixed>) immediately prior to the .runAnomalyModel call and optionally
save/restore .Random.seed to avoid global side effects), or better, update
.runAnomalyModel to accept a seed/random_state argument and pass that through to
the underlying isolation-forest routine so the model is deterministic; implement
one of these approaches and update the test to use the deterministic option.
| temporal = spectronaut_raw[, .(Condition = unique(Condition), | ||
| Run = unique(Run)), | ||
| by = BioReplicate] | ||
| temporal$Order = seq(1:nrow(temporal)) | ||
| temporal = temporal[, c("Run", "Order")] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: run_order has duplicated Runs and mismatched lengths.
Using \code{unique(Condition)} per group can return length > 1; assigning both Condition and Run together then building Order yields duplicate Runs with conflicting orders.
Apply this fix to map each Run to a single order:
-temporal = spectronaut_raw[, .(Condition = unique(Condition),
- Run = unique(Run)),
- by = BioReplicate]
-temporal$Order = seq(1:nrow(temporal))
-temporal = temporal[, c("Run", "Order")]
+temporal = unique(spectronaut_raw[, .(Run)])
+temporal[, Order := seq_len(.N)]Optionally sort by Run or acquisition time if available before assigning Order.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| temporal = spectronaut_raw[, .(Condition = unique(Condition), | |
| Run = unique(Run)), | |
| by = BioReplicate] | |
| temporal$Order = seq(1:nrow(temporal)) | |
| temporal = temporal[, c("Run", "Order")] | |
| # Replace the block starting at line 347 with: | |
| temporal = unique(spectronaut_raw[, .(Run)]) | |
| temporal[, Order := seq_len(.N)] |
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
R/converters_SpectronauttoMSstatsFormat.R (4)
11-11: Fix grammar and finalize semantics for temporal vectorChange “vector must have as many values” → “must have the same length,” and decide between NULL or FALSE for “no temporal feature engineering” per feature. Prior review requested FALSE; code/docs still use NULL. Align docs, validation, and implementation.
-#' @param anomalyModelFeatureTemporal character vector of temporal direction corresponding to columns passed to anomalyModelFeatures. Values must be one of: `mean_decrease`, `mean_increase`, `dispersion_increase`, or NULL (to perform no temporal feature engineering). Default is empty vector. If calculateAnomalyScores=TRUE, vector must have as many values as anomalyModelFeatures (even if all NULL). +#' @param anomalyModelFeatureTemporal character vector parallel to anomalyModelFeatures. Each entry must be one of: `mean_decrease`, `mean_increase`, `dispersion_increase`, or `NULL` (no temporal feature engineering for that feature). Default is empty vector. If calculateAnomalyScores=TRUE, the vector must have the same length as anomalyModelFeatures.If the code actually supports FALSE instead of NULL for “no temporal feature engineering,” update the above accordingly and ensure the validator enforces equal lengths.
14-14: Clarify that run order must be sequential integersMake the constraint explicit to help users.
-#' @param runOrder Temporal order of MS runs. Should be a two column data.table with columns `Run` and `Order`, where `Run` matches the run name output by Spectronaut and `Order` is an integer. Used to engineer the temporal features defined in anomalyModelFeatureTemporal. +#' @param runOrder Temporal order of MS runs. Two-column data.table with `Run` and `Order`, where `Run` matches Spectronaut run names and `Order` is a sequential integer (1, 2, 3, ...). Used to engineer the temporal features in anomalyModelFeatureTemporal.
136-141: Wrap anomaly scoring in tryCatch for actionable errors and logsSurface clear messages and keep logs consistent.
- if (calculateAnomalyScores){ - input = MSstatsConvert::MSstatsAnomalyScores( - input, anomalyModelFeatures, anomalyModelFeatureTemporal, - removeMissingFeatures, anomalyModelFeatureCount, runOrder, n_trees, - max_depth, numberOfCores) - } + if (calculateAnomalyScores){ + tryCatch({ + input <- MSstatsConvert::MSstatsAnomalyScores( + input, anomalyModelFeatures, anomalyModelFeatureTemporal, + removeMissingFeatures, anomalyModelFeatureCount, runOrder, n_trees, + max_depth, numberOfCores) + }, error = function(e) { + msg <- paste("Failed to calculate anomaly scores:", conditionMessage(e)) + getOption("MSstatsLog")("ERROR", msg) + getOption("MSstatsMsg")("ERROR", msg) + stop(msg, call. = FALSE) + }) + }
126-129: Avoid overwriting existing IsotopeLabelType valuescolumns_to_fill unconditionally sets IsotopeLabelType = "L", risking data loss. Fill only when the column is missing or NA.
- exact_filtering = list(excluded_quant = excluded_quant_filter), - columns_to_fill = list("IsotopeLabelType" = "L"), + exact_filtering = list(excluded_quant = excluded_quant_filter), + columns_to_fill = NULL, anomaly_metrics = anomalyModelFeatures)Add before MSstatsPreprocess (outside this hunk):
# Preserve existing labels; add default only if missing/NA if (!"IsotopeLabelType" %in% colnames(input)) { input[, IsotopeLabelType := "L"] } else { input[is.na(IsotopeLabelType) | IsotopeLabelType == "", IsotopeLabelType := "L"] }
🧹 Nitpick comments (3)
R/converters_SpectronauttoMSstatsFormat.R (3)
3-3: Docs vs behavior mismatch for F.ExcludedFromQuantificationDocs say rows “will be removed,” but code config uses behavior="fill" + drop_column, which sets values to NA and drops the column (not guaranteed row removal). Clarify this and mention the standardized column name without dot.
-#' @param input name of Spectronaut output, which is long-format. ProteinName, PeptideSequence, PrecursorCharge, FragmentIon, ProductCharge, IsotopeLabelType, Condition, BioReplicate, Run, Intensity, F.ExcludedFromQuantification are required. Rows with F.ExcludedFromQuantification=True will be removed. +#' @param input name of Spectronaut output (long-format). Required columns: ProteinName, PeptideSequence, PrecursorCharge, FragmentIon, ProductCharge, IsotopeLabelType, Condition, BioReplicate, Run, Intensity, F.ExcludedFromQuantification (Spectronaut), which is standardized to FExcludedFromQuantification by import. When excludedFromQuantificationFilter=TRUE, intensities with FExcludedFromQuantification=TRUE are set to NA and the column is dropped (rows may be removed downstream).
7-8: Column naming in docs (EG.Qvalue vs EGQvalue)Standardization removes punctuation; adjust docs to avoid confusion.
-#' @param filter_with_Qvalue FALSE(default) will not perform any filtering. TRUE will filter out the intensities that have greater than qvalue_cutoff in EG.Qvalue column. Those intensities will be replaced with zero and will be considered as censored missing values for imputation purpose. -#' @param qvalue_cutoff Cutoff for EG.Qvalue. default is 0.01. +#' @param filter_with_Qvalue FALSE (default) performs no filtering. TRUE filters intensities with EG.Qvalue/EGQvalue > qvalue_cutoff (column name after standardization is EGQvalue). Filtered intensities are replaced with zero and treated as censored missing values for imputation. +#' @param qvalue_cutoff Cutoff for EG.Qvalue/EGQvalue. Default is 0.01.
48-72: Enforce key invariants in validator (lengths, empties, cores on Windows)Ensure .validateMSstatsConverterParameters checks:
- calculateAnomalyScores implies length(anomalyModelFeatures) > 0.
- If anomalyModelFeatureTemporal provided, its length equals anomalyModelFeatures.
- If numberOfCores > 1 on Windows, warn and coerce to 1.
# Example checks (either inside .validateMSstatsConverterParameters or here before use) if (calculateAnomalyScores && length(anomalyModelFeatures) == 0L) stop("anomalyModelFeatures must be non-empty when calculateAnomalyScores=TRUE") if (calculateAnomalyScores && length(anomalyModelFeatureTemporal) > 0L && length(anomalyModelFeatureTemporal) != length(anomalyModelFeatures)) stop("anomalyModelFeatureTemporal must have the same length as anomalyModelFeatures") if (.Platform$OS.type == "windows" && numberOfCores > 1L) { getOption("MSstatsMsg")("WARN", "Parallel anomaly scoring is not supported on Windows; using 1 core.") numberOfCores <- 1L }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
R/converters_SpectronauttoMSstatsFormat.R(4 hunks)
🔇 Additional comments (6)
R/converters_SpectronauttoMSstatsFormat.R (6)
35-47: API additions read cleanly; defaults look reasonableNo functional concerns with the new parameters and defaults.
79-80: Good: standardizing feature names earlyThis avoids downstream mismatches.
84-86: Plumbing anomaly flags into MSstatsClean looks correctNo issues spotted.
105-112: Fix confirmed: excludedFromQuantificationFilter is now honoredThe filter flag now respects the function parameter. Nice.
131-135: BalancedDesign wiring for anomaly_metrics is consistentLooks good.
16-16: Verified: “auto” uses base-2 log2(N) on the number of runs; documentation is accurate.
Motivation and Context
This branch implements the anomaly detection model in MSstats+ allowing us to incorporate information about the underlying peak quality into the differential analysis. Currently only implemented directly into the Spectronaut converter, but provides functionalities to be implemented for any tool.
Changes
Testing
Please describe any unit tests you added or modified to verify your changes.
Checklist Before Requesting a Review
Summary by CodeRabbit