Skip to content

Conversation

@tonywu1999
Copy link
Contributor

@tonywu1999 tonywu1999 commented Dec 9, 2025

Motivation and Context

One user got this bug when using the DIANN converter, notably at the subfunction .cleanDIANNCleanAndFilterData()

vector memory limit of 24.0Gb reached, see mem.maxVsize()

Changes

  • na.omit is horribly inefficient. Replace it with a more efficient alternative.

Testing

Ran the following benchmark:

library(data.table)
library(microbenchmark)
library(pryr)

# Create a large test dataset similar to your use case
set.seed(123)
n_rows <- 5e6  # 5 million rows
test_data <- data.table(
  FragmentIon = sample(c("y1", "y2", "b1", "b2", "y1-NH3", "b1-H2O", "y3"), n_rows, replace = TRUE),
  Intensity = rnorm(n_rows),
  QuantColumn = c(rnorm(n_rows * 0.85), rep(NA, n_rows * 0.15))  # 15% NAs
)

print(paste("Dataset size:", format(object.size(test_data), units = "MB")))
print(paste("Number of NAs:", sum(is.na(test_data$QuantColumn))))

# Method 1: Original with na.omit
method1 <- function(dt) {
  dt <- dt[!grepl("NH3", FragmentIon)]
  dt <- dt[!grepl("H2O", FragmentIon)]
  dt <- na.omit(dt, cols = "QuantColumn")
  return(dt)
}

# Method 2: Direct subsetting
method2 <- function(dt) {
  dt <- dt[!grepl("NH3", FragmentIon) & !grepl("H2O", FragmentIon)]
  dt <- dt[!is.na(QuantColumn)]
  return(dt)
}

# Method 3: Single combined operation
method3 <- function(dt) {
  dt <- dt[!grepl("NH3", FragmentIon) & 
           !grepl("H2O", FragmentIon) & 
           !is.na(QuantColumn)]
  return(dt)
}

# Method 4: Compiled regex (even faster)
method4 <- function(dt) {
  dt <- dt[!grepl("NH3|H2O", FragmentIon) & !is.na(QuantColumn)]
  return(dt)
}

# Memory profiling using pryr (install if needed)
cat("\n=== Memory Usage (pryr) ===\n")
for(i in 1:4) {
  method <- get(paste0("method", i))
  mem_used <- mem_change(result <- method(copy(test_data)))
  cat(sprintf("Method %d: %s\n", i, format(mem_used, units = "MB")))
}

# Speed benchmark
cat("\n=== Speed Benchmark ===\n")
benchmark_result <- microbenchmark(
  na.omit = method1(copy(test_data)),
  direct_subset = method2(copy(test_data)),
  single_operation = method3(copy(test_data)),
  compiled_regex = method4(copy(test_data)),
  times = 10
)

print(benchmark_result)

Results:

"Dataset size: 114.4 Mb"
"Number of NAs: 750000"

=== Memory Usage (pryr) ===
Method 1: 74706640
Method 2: 51632
Method 3: 34592
Method 4: 34592

=== Speed Benchmark ===
Unit: milliseconds
             expr      min       lq     mean   median       uq       max neval
          na.omit 868.6980 913.4014 931.3736 922.4969 958.0429 1015.0258    10
    direct_subset 955.2743 957.7577 974.1970 965.6933 990.9305 1012.8817    10
 single_operation 895.6982 910.1253 939.8575 941.2111 962.7289  996.8566    10
   compiled_regex 581.6078 621.7588 644.6760 636.4692 663.2657  729.7551    10

Checklist Before Requesting a Review

  • I have read the MSstats contributing guidelines
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

Summary by CodeRabbit

  • Refactor
    • Optimized internal data filtering operations by consolidating multiple processing steps into a single, more efficient operation.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 9, 2025

Walkthrough

The change consolidates two separate filtering operations in the DIANN cleaning function into a single conditional expression. NH3/H2O fragment removal and NA-value filtering for the quantification column are now performed in one data.table subset operation without altering the function's overall behavior.

Changes

Cohort / File(s) Summary
DIANN filtering consolidation
R/clean_DIANN.R
Merges NH3/H2O loss fragment removal and NA-filtering for quantification column into a single conditional expression; replaces separate grepl calls and na.omit with unified data.table subset operation.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

  • Single file refactoring that consolidates existing logic without changing behavior
  • Straightforward consolidation of two filtering steps into one condition
  • Minimal risk; verify that the combined conditional produces identical results to the original separate operations

Possibly related PRs

Suggested reviewers

  • mstaniak

Poem

🐰 Two filters became one today,
NH3 and NA gone away,
One subset does the work so neat,
The logic flow is now complete! ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: optimizing DIANN cleaning for memory and speed efficiency, which directly addresses the bug fix.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The pull request description includes all required sections with appropriate detail: Motivation and Context clearly describes the user-encountered bug, Changes explains the optimization approach, Testing provides comprehensive benchmark results, and the Checklist is completed with boxes checked.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch bug-diann-speed

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@tonywu1999 tonywu1999 marked this pull request as ready for review December 9, 2025 17:26
@tonywu1999 tonywu1999 merged commit 3dd8a40 into devel Dec 9, 2025
2 checks passed
@tonywu1999 tonywu1999 deleted the bug-diann-speed branch December 9, 2025 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants