Rigorous biomarker discovery from high-dimensional omics data with zero data leakage.
All 11 core modules validated on TCGA pan-cancer miRNA data (10,366 samples, 2,566 features):
| Module | Status | Key Metrics |
|---|---|---|
| OmicPipeline | PASS | Quick validation AUC: 0.936 |
| BenchmarkService | PASS | Nested CV AUC: 0.876 (honest estimate) |
| GOF Filters | PASS | Discovered miR-139/183/145 family |
| Bayesian Tuning | PASS | Training AUC: 0.984 |
| AutoXAI | PASS | 9 correlation warnings flagged |
| Stability Ensemble | PASS | 13 features at 100% stability |
| Sequential Selector (HSFS) | PASS | AUC: 0.965 |
| Synthetic Data (SMOTE) | PASS | Quality score validated |
| Calibration | PASS | Platt scaling: 0-0.56 → 0.003-0.96 |
| FrozenComBat | PASS | Batch correction validated |
| Multi-Omics | PASS | Combined AUC: 0.827 |
| Deep Learning | SKIPPED | torch not installed |
Top Biomarkers: miR-183-5p, miR-145-5p, miR-182-5p are established cancer biomarkers with known oncogenic/tumor suppressor roles.
OmicSelector is an R package for biomarker discovery that enforces methodologically sound machine learning practices. Built on the mlr3 ecosystem, it guarantees:
- Zero Data Leakage: Feature selection occurs strictly inside cross-validation folds
- Proper Nested CV: Separation of outer (evaluation) and inner (selection) loops
- Feature Stability: Nogueira Stability Index for reproducible biomarker sets
- Multi-Objective Selection: Balance performance, stability, and parsimony
# Install from GitHub
remotes::install_github("kstawiski/OmicSelector")
# Load the package
library(OmicSelector)OmicSelector expects a data.frame with:
- Feature columns: Numeric values (gene expression, miRNA counts, etc.)
- Target column: Factor or character for classification, numeric for regression
- Optional metadata: patient_id, batch, etc.
# Your data should look like this:
head(my_data)
#> gene_A gene_B gene_C gene_D gene_E outcome
#> 1 2.34 1.56 3.21 0.89 1.45 Case
#> 2 1.12 2.89 0.45 2.11 0.78 Control
#> 3 3.45 0.23 2.67 1.34 2.90 Case
#> ...
# Required structure:
# - All feature columns must be numeric
# - Target column can be factor, character (classification) or numeric (regression)
# - No row names required (use a patient_id column if needed)# Simulate expression data for 100 samples, 50 genes
set.seed(42)
n_samples <- 100
n_genes <- 50
# Create feature matrix
features <- matrix(rnorm(n_samples * n_genes), nrow = n_samples)
colnames(features) <- paste0("gene_", 1:n_genes)
# Add some signal to first 5 genes
outcome <- factor(rep(c("Case", "Control"), each = n_samples/2))
features[outcome == "Case", 1:5] <- features[outcome == "Case", 1:5] + 1.5
# Combine into data.frame
my_data <- data.frame(features, outcome = outcome)For multi-omics, pass a named list of data.frames (same samples, same row order):
# RNA-seq data (100 samples x 1000 genes)
rna_data <- data.frame(matrix(rnorm(100 * 1000), nrow = 100))
names(rna_data) <- paste0("gene_", 1:1000)
# miRNA data (100 samples x 200 miRNAs)
mirna_data <- data.frame(matrix(rnorm(100 * 200), nrow = 100))
names(mirna_data) <- paste0("miR_", 1:200)
# Target vector
outcome <- factor(rep(c("Case", "Control"), each = 50))
# Create multi-omics pipeline
pipeline <- OmicPipeline$new(
data = list(rna = rna_data, mirna = mirna_data),
target = outcome,
positive = "Case"
)
# Features are automatically namespaced: rna::gene_1, mirna::miR_1, etc.library(OmicSelector)
# Create pipeline from data
pipeline <- OmicPipeline$new(
data = my_data, # Your data.frame
target = "outcome", # Name of the target column
positive = "Case" # Positive class for AUC calculation
)
#> OmicPipeline created: 100 samples, 50 features, target='outcome'# Create a learner with embedded feature selection
learner <- pipeline$create_graph_learner(
filter = "anova", # Feature selection method
model = "ranger", # Classification model
n_features = 10 # Select top 10 features
)# Create benchmark service
benchmark <- BenchmarkService$new(
task = pipeline, # Your pipeline
outer_folds = 5, # Outer CV folds (evaluation)
inner_folds = 3, # Inner CV folds (selection)
seed = 42 # For reproducibility
)
# Add learner and run
benchmark$add_learner(learner)
result <- benchmark$run()
#> Running nested cross-validation...
#> Completed in 12.3 seconds
# View results
print(result)Run the interactive WebUI locally:
shiny::runApp("shiny/omicselector", host = "0.0.0.0", port = 3838)Run via Docker (Shiny WebUI + RStudio Server):
docker compose upThen open:
- WebUI:
http://localhost:3838 - RStudio Server:
http://localhost:8787(user:rstudio, password:omicselector)
Uploads and results are saved under /OmicSelector/analyses/<analysis_id> inside
the container and persisted to ./analyses on the host. Set
OMICSELECTOR_ANALYSES_DIR to override the storage location.
pipeline <- OmicPipeline$new(
data = my_data,
target = "outcome",
positive = "Case"
)
learner <- pipeline$create_graph_learner(
filter = "anova",
model = "rpart",
n_features = 20
)
fit <- pipeline$fit(learner, seed = 1)
fit$selected_featurespipeline <- OmicPipeline$new(
data = list(rna = rna_data, mirna = mirna_data),
target = "outcome",
positive = "Case"
)
learner <- pipeline$create_graph_learner(
filter = "mrmr",
model = "rpart",
n_features = 30
)learner <- pipeline$create_graph_learner(
filter = "anova",
model = "rpart",
n_features = 20,
screening = TRUE,
screening_frac = 0.2
)
result <- pipeline$benchmark(
learners = learner,
outer_folds = 5,
inner_folds = 3,
seed = 42,
cache_dir = "cache",
parallel = TRUE,
threads = 1
)# Requires: torch + mlr3torch
learner <- pipeline$create_graph_learner(
filter = "anova",
model = "mlp",
n_features = 50,
autoencoder = list(
latent_dim = 32,
hidden_layers = c(128, 64),
epochs = 50,
batch_size = 64,
early_stopping = TRUE,
patience = 10
)
)
result <- pipeline$benchmark(
learners = learner,
outer_folds = 3,
inner_folds = 2,
seed = 1
)Transfer learning via autoencoder pretraining:
ae <- autoencoder_fit(x_train, latent_dim = 32, epochs = 50)
autoencoder_save(ae, "pretrained_ae.pt")
learner <- pipeline$create_graph_learner(
filter = "anova",
model = "mlp",
n_features = 50,
autoencoder = list(
pretrained = "pretrained_ae.pt",
freeze_encoder = TRUE
)
)Checkpoint export/import for fine‑tuning:
fit <- pipeline$fit(learner, seed = 1)
export_omicfit_checkpoint(fit, "mlp_checkpoint.pt")
finetuned <- finetune_mlr3torch_checkpoint(
learner = learner,
task = pipeline$get_task(),
path = "mlp_checkpoint.pt",
epochs = 30
)
# Or stay at the OmicFit level
finetuned_fit <- finetune_omicfit_checkpoint(
fit = fit,
task = pipeline$get_task(),
path = "mlp_checkpoint.pt",
epochs = 30
)Use the filter parameter in create_graph_learner():
Univariate Statistical Tests
| Method | Code | Description | Best For |
|---|---|---|---|
| ANOVA F-test | "anova" |
Tests mean differences between classes | Default choice, continuous features |
| Kruskal-Wallis | "kruskal" |
Non-parametric rank-based test | Non-normal distributions |
| Chi-Squared | "chi_squared" |
Chi-squared test | Discrete/categorical features |
Variance-Based
| Method | Code | Description | Best For |
|---|---|---|---|
| Variance | "variance" |
Removes low-variance features | Pre-filtering, removing constants |
Correlation-Based
| Method | Code | Description | Best For |
|---|---|---|---|
| Correlation | "correlation" |
Correlation with target | Quick univariate filter |
Information-Theoretic
| Method | Code | Description | Best For |
|---|---|---|---|
| Information Gain | "information_gain" |
Entropy-based importance | Mixed feature types |
| Gain Ratio | "gain_ratio" |
Normalized information gain | Avoiding bias toward high-cardinality |
| mRMR | "mrmr" |
Min Redundancy Max Relevance | Reducing feature redundancy |
| CMIM | "cmim" |
Conditional Mutual Info Max | Complex feature dependencies |
| JMIM | "jmim" |
Joint Mutual Info Max | Capturing feature interactions |
| JMI | "jmi" |
Joint Mutual Information | Similar to JMIM |
Model-Based Importance
| Method | Code | Description | Best For |
|---|---|---|---|
| AUC | "auc" |
AUC of univariate models | Classification performance |
| Relief | "relief" |
Instance-based algorithm | Detecting interactions |
| RF Importance | "importance" |
Random Forest importance | Non-linear relationships |
| Permutation | "permutation" |
Permutation importance | Model-agnostic importance |
# Example: Try different filters
learner_anova <- pipeline$create_graph_learner(filter = "anova", model = "ranger", n_features = 20)
learner_mrmr <- pipeline$create_graph_learner(filter = "mrmr", model = "ranger", n_features = 20)
learner_relief <- pipeline$create_graph_learner(filter = "relief", model = "xgboost", n_features = 15)Use the model parameter in create_graph_learner():
Tree-Based Ensembles
| Model | Code | Description | Strengths |
|---|---|---|---|
| Random Forest | "ranger" |
Fast RF implementation | Handles interactions, robust, default choice |
| XGBoost | "xgboost" |
Gradient boosting | High performance, handles missing values |
| LightGBM | "lightgbm" |
Fast gradient boosting | Very fast, memory efficient |
| Decision Tree | "rpart" |
Single CART tree | Interpretable, fast |
Linear Models
| Model | Code | Description | Strengths |
|---|---|---|---|
| Elastic Net | "glmnet" |
L1/L2 regularized regression | Interpretable coefficients, feature selection |
| Logistic Regression | "log_reg" |
Simple logistic regression | Baseline, highly interpretable |
Distance-Based Methods
| Model | Code | Description | Strengths |
|---|---|---|---|
| SVM | "svm" |
Support Vector Machine (RBF) | High-dimensional data |
| k-NN | "kknn" |
K-Nearest Neighbors | Non-parametric, simple |
Probabilistic Classifiers
| Model | Code | Description | Strengths |
|---|---|---|---|
| Naive Bayes | "naive_bayes" |
Probabilistic classifier | Fast, works with small data |
| LDA | "lda" |
Linear Discriminant Analysis | Dimensionality reduction |
| QDA | "qda" |
Quadratic Discriminant Analysis | Non-linear decision boundaries |
Neural Networks
| Model | Code | Description | Strengths |
|---|---|---|---|
| Neural Net | "nnet" |
Single-layer neural network | Non-linear relationships |
# Example: Compare models
learner_rf <- pipeline$create_graph_learner(filter = "anova", model = "ranger", n_features = 15)
learner_xgb <- pipeline$create_graph_learner(filter = "anova", model = "xgboost", n_features = 15)
learner_glmnet <- pipeline$create_graph_learner(filter = "anova", model = "glmnet", n_features = 15)OmicSelector is built on the mlr3 ecosystem. You can use any mlr3 learner directly with BenchmarkService:
library(mlr3learners)
# Get the task from the pipeline
task <- pipeline$get_task()
# Create any mlr3 learner with probability predictions
xgboost_learner <- lrn("classif.xgboost", predict_type = "prob")
nnet_learner <- lrn("classif.nnet", predict_type = "prob")
# Wrap in a graph with feature selection (optional)
library(mlr3pipelines)
library(mlr3filters)
xgb_graph <- po("filter", filter = flt("anova"), filter.nfeat = 20) %>>%
po("learner", xgboost_learner)
# Convert to GraphLearner
xgb_graph_learner <- as_learner(xgb_graph)
xgb_graph_learner$id <- "xgboost_anova_20"
# Use in benchmark
benchmark <- BenchmarkService$new(task = pipeline, outer_folds = 5, seed = 42)
benchmark$add_learner(xgb_graph_learner)
result <- benchmark$run()See the mlr3learners documentation for all available learners.
learner <- pipeline$create_graph_learner(
# Feature Selection
filter = "anova", # Filter method (see table above)
n_features = 20, # Number of features to select
# Use decimal < 1 for proportion (e.g., 0.1 = top 10%)
# Preprocessing
impute_method = "median", # Handle missing values: "median", "mean", "sample", "pmm"
scale = TRUE, # Scale features to zero mean, unit variance
# Class Imbalance (applied INSIDE each CV fold)
oversample = NULL, # NULL (none), "smote", or "rose"
# Batch Effect Correction
batch_correct = FALSE # TRUE to apply FrozenComBat (requires batch column)
)For complex missing data patterns, use PMM imputation via the mice package:
# PMM uses predictive models to impute missing values
# Best for: MAR (Missing At Random) data, preserving distributions
learner <- pipeline$create_graph_learner(
filter = "anova",
model = "ranger",
n_features = 20,
impute_method = "pmm" # Predictive Mean Matching via mice
)
# PMM advantages:
# - Preserves data distribution (unlike mean imputation)
# - Handles complex correlation structures
# - Produces plausible values from observed data
# - Applied properly inside each CV fold to prevent leakage# Create benchmark service
benchmark <- BenchmarkService$new(pipeline, outer_folds = 5, inner_folds = 3, seed = 42)
# Add multiple learners with different configurations
benchmark$add_learner(
pipeline$create_graph_learner(filter = "anova", model = "ranger", n_features = 10),
id = "anova_rf_10"
)
benchmark$add_learner(
pipeline$create_graph_learner(filter = "anova", model = "ranger", n_features = 20),
id = "anova_rf_20"
)
benchmark$add_learner(
pipeline$create_graph_learner(filter = "mrmr", model = "glmnet", n_features = 15),
id = "mrmr_glmnet_15"
)
benchmark$add_learner(
pipeline$create_graph_learner(filter = "variance", model = "svm", n_features = 25),
id = "var_svm_25"
)
# Run all configurations
result <- benchmark$run()
# Compare performance
print(result$performance)
#> learner_id classif.auc classif.acc classif.bbrier
#> 1 anova_rf_10 0.892 0.840 0.112
#> 2 anova_rf_20 0.901 0.850 0.098
#> 3 mrmr_glmnet_15 0.885 0.830 0.121
#> 4 var_svm_25 0.878 0.820 0.134OmicSelector 2.0 includes state-of-the-art methods validated on TCGA data:
Traditional filters (ANOVA, variance) miss biologically important features that are "off" in one condition but expressed in another. GOF filters detect these patterns:
library(OmicSelector)
# Use KS filter - captures distributional differences including zero-inflation
learner <- pipeline$create_graph_learner(
filter = "gof_ks", # Kolmogorov-Smirnov GOF filter
model = "ranger",
n_features = 30
)
# Other GOF options:
# filter = "hurdle" # Two-part model (zero-frequency + magnitude)
# filter = "zero_prop" # Simple zero-proportion difference
# GOF filters discovered miR-200c/miR-141 family in TCGA kidney data
# that ANOVA missed - these are established EMT regulators in cancerReplace grid search with intelligent Bayesian optimization:
# Bayesian-tuned glmnet with omics-optimized search space
autotuner <- make_autotuner_glmnet(
task,
n_evals = 20, # Budget of evaluations
inner_folds = 3 # Inner CV for tuning
)
autotuner$train(task)
# Get optimal parameters
params <- get_optimal_params(autotuner)
# Returns: alpha, s (lambda) optimized for your dataGet SHAP-based interpretability with automatic warnings for correlated features:
# Train a model
learner <- lrn("classif.ranger", predict_type = "prob", importance = "impurity")
learner$train(task)
# Run XAI pipeline
xai <- xai_pipeline(
learner = learner,
task = task,
top_k = 20,
cor_threshold = 0.7 # Warn if features correlated > 0.7
)
# View results
print(xai$top_features) # Top features by permutation importance
print(xai$correlations$clusters) # Correlated feature clusters
plot_xai_importance(xai) # Visualization
# Warnings prevent SHAP misinterpretation when features are collinearFind reproducible biomarkers across resamples:
# Create stability ensemble with multiple filters
ensemble <- create_stability_ensemble(
preset = "default", # Uses mRMR, AUC, Information Gain
n_bootstrap = 100, # 100 bootstrap iterations
n_features = 30
)
# Fit and get stable features
ensemble$fit(task, seed = 42)
stable <- ensemble$get_feature_importance(30)
# Features with >90% selection frequency are highly reproducible
# TCGA validation found 6 biomarkers with 100% stabilityProper SMOTE application inside CV folds:
# Balance classes with SMOTE (applied inside each fold)
task_balanced <- smote_augment(task, ratio = 1.0, k = 5)
# Validate synthetic data quality
validation <- validate_synthetic(real_data, synthetic_data)
print(validation$quality_score) # 0-1, higher is better# Compute Nogueira Stability Index with Bootstrap Confidence Intervals
stability <- compute_stability_from_resample(result$benchmark_result)
print(stability)
#> Nogueira Stability Index: 0.823
#> Interpretation: Good stability
# Bootstrap CI for uncertainty quantification
stability_ci <- compute_nogueira_stability(
feature_matrix,
n_bootstrap = 1000, # Number of bootstrap iterations
conf_level = 0.95 # 95% confidence interval
)
print(stability_ci)
#> Nogueira Stability: 0.823
#> 95% Bootstrap CI: [0.756, 0.889]
#> Interpretation: Good stability (CI does not include 0.7 threshold)
# Get features selected in ALL folds (consensus)
consensus_features <- stability$consensus_features
print(consensus_features)
#> [1] "gene_1" "gene_3" "gene_5"
# Get selection frequency per feature
head(stability$selection_frequency)
#> gene_1 gene_3 gene_5 gene_2 gene_4
#> 1.00 1.00 1.00 0.80 0.60# Select best signature balancing multiple objectives
best <- select_best_signature(
result,
mode = "weighted", # Selection mode
stability_weight = 0.3, # Weight for stability (vs performance)
parsimony_weight = 0.1 # Weight for fewer features
)
# Available modes:
# - "best_auc": Highest AUC only
# - "best_stability": Highest Nogueira index only
# - "weighted": Weighted combination (default)
# - "pareto": Pareto-optimal with distance-to-utopia# After getting predictions from your model
probabilities <- predict(model, newdata = test_data, type = "prob")
true_labels <- test_data$outcome
# Compute calibration metrics
ece <- compute_ece(probabilities, true_labels)
brier <- decompose_brier(probabilities, true_labels)
# Fit calibration (Platt scaling or isotonic regression)
calibrator <- fit_platt_scaling(probabilities, true_labels)
# or: calibrator <- fit_isotonic_calibration(probabilities, true_labels)
# Apply to new predictions (calibrator is a function)
calibrated_probs <- calibrator(new_probabilities)# Include batch information in pipeline
pipeline <- OmicPipeline$new(
data = my_data,
target = "outcome",
batch = "batch_id" # Column containing batch labels
)
# Enable batch correction (parameters fit on training data only)
learner <- pipeline$create_graph_learner(
filter = "anova",
model = "ranger",
n_features = 20,
batch_correct = TRUE # Applies FrozenComBat inside CV
)# Create explainer for a trained model
explainer <- create_explainer(trained_learner, task)
# Compute SHAP values
shap <- shap_values(explainer, new_data)
# Check for problematic correlations
# (SHAP can be misleading when features are highly correlated)
correlations <- check_feature_correlations(
data = my_data,
features = selected_features,
threshold = 0.7 # Warn if correlation > 0.7
)# Create separate pipelines for each modality
rna_pipeline <- OmicPipeline$new(rna_data, target = outcome)
mirna_pipeline <- OmicPipeline$new(mirna_data, target = outcome)
# Create stacker
stacker <- MultiOmicsStacker$new(
modalities = list(
rna = rna_pipeline$get_task(),
mirna = mirna_pipeline$get_task()
),
meta_learner = mlr3::lrn("classif.glmnet")
)
# Train and predict
stacker$train()
predictions <- stacker$predict(new_data)OmicSelector includes rigorous QC checks for clinical-grade biomarker discovery:
# Run comprehensive QC validation
qc_results <- project$run_qc_validation()
# QC checks include:
# - Sample size adequacy (minimum samples per class)
# - Dimensionality ratio (features vs samples)
# - Events Per Variable (EPV) for model stability
# - Missing data patterns
# - Feature quality (variance, zero-inflation)
print(qc_results)
#> QC Validation Results:
#> sample_size: PASS (n=200, min_class=87)
#> dimensionality: WARN (p/n ratio = 2.5, high dimensionality)
#> epv: PASS (EPV = 8.7, adequate for selected features)
#> missing_data: PASS (0.3% missing, handled with PMM)
#> feature_quality: PASS (no constant features detected)
# Warnings guide corrective action without blocking workflowGenerate clinical-ready Model Cards for stakeholder communication:
# After training a final model
model_card <- project$generate_model_card(
model_name = "Glioblastoma miRNA Classifier",
intended_use = "Diagnostic aid for glioblastoma classification",
target_population = "Adult patients with suspected CNS tumors",
clinical_context = "Pre-surgical tumor characterization"
)
# Model Card includes:
# - Performance metrics with confidence intervals
# - Training data characteristics
# - Intended use and limitations
# - Feature stability assessment
# - TRIPOD+AI compliance notes
# Export as HTML for stakeholders
export_model_card(model_card, "model_card.html")For reproducible batch processing:
# config.yaml
data:
file: "data/expression.csv"
target: "outcome"
positive: "Case"
pipeline:
filter: "anova"
model: "ranger"
n_features: 20
scale: true
impute: "median"
benchmark:
outer_folds: 5
inner_folds: 3
seed: 42
stratify: true
output:
dir: "./results"
report_format: "html"Run via CLI:
Rscript inst/bin/omicselector run --config=config.yaml# Build image
docker build -f Dockerfile.core -t omicselector:2.0 .
# Run interactive R session
docker run -it --rm -v $(pwd):/workspace omicselector:2.0 R
# Run analysis with config
docker run -it --rm -v $(pwd):/workspace omicselector:2.0 \
Rscript inst/bin/omicselector run --config=/workspace/config.yaml| Module | Description |
|---|---|
OmicPipeline |
Build mlr3 graphs with zero-leakage preprocessing |
BenchmarkService |
Nested CV with proper inner/outer loop separation |
select_best_signature() |
Multi-objective signature selection |
compute_nogueira_stability() |
Feature selection stability metrics |
| Module | Description |
|---|---|
FrozenComBat |
Batch correction with frozen parameters |
fit_platt_scaling() |
Platt calibration |
fit_isotonic_calibration() |
Isotonic regression calibration |
| Module | Description |
|---|---|
FilterGOF_KS |
Kolmogorov-Smirnov filter for sparse data |
FilterHurdle |
Two-part hurdle model filter |
FilterZeroProp |
Zero-proportion difference filter |
make_autotuner_glmnet() |
Bayesian-tuned elastic net |
make_autotuner_xgboost() |
Bayesian-tuned XGBoost |
make_autotuner_ranger() |
Bayesian-tuned Random Forest |
xai_pipeline() |
DALEX-based interpretability with correlation warnings |
plot_xai_importance() |
Feature importance visualization |
create_stability_ensemble() |
Bootstrap stability feature selection |
smote_augment() |
SMOTE for class imbalance |
validate_synthetic() |
Synthetic data quality validation |
| Module | Description |
|---|---|
validate_data_quality() |
Sample size, dimensionality, EPV checks |
generate_model_card() |
Clinical-ready Model Card for stakeholders |
generate_tripod_report() |
TRIPOD+AI compliant reporting |
| Module | Description |
|---|---|
MultiOmicsStacker |
Late integration of multi-omics data |
export_vetiver() |
Model export for deployment |
export_bundle() |
Self-contained model bundle with metadata |
export_onnx() |
ONNX export for cross-platform deployment |
@article{stawiski2022omicselector,
title={OmicSelector: automatic feature selection and deep learning
modeling for omic experiments},
author={Stawiski, Konrad and Kaszkowiak, Marcin and Mikulski, Damian and others},
journal={bioRxiv},
year={2022},
doi={10.1101/2022.06.01.494299}
}- Konrad Stawiski, M.D., Ph.D. (konrad.stawiski@umed.lodz.pl)
- Marcin Kaszkowiak, M.D.
- Damian Mikulski, M.D.
Supervised by: Prof. Wojciech Fendler, M.D., Ph.D.
Department of Biostatistics and Translational Medicine, Medical University of Lodz, Poland