OmicSelector 2.0

Rigorous biomarker discovery from high-dimensional omics data with zero data leakage.

Validation Status

All 11 core modules validated on TCGA pan-cancer miRNA data (10,366 samples, 2,566 features):

Module	Status	Key Metrics
OmicPipeline	PASS	Quick validation AUC: 0.936
BenchmarkService	PASS	Nested CV AUC: 0.876 (honest estimate)
GOF Filters	PASS	Discovered miR-139/183/145 family
Bayesian Tuning	PASS	Training AUC: 0.984
AutoXAI	PASS	9 correlation warnings flagged
Stability Ensemble	PASS	13 features at 100% stability
Sequential Selector (HSFS)	PASS	AUC: 0.965
Synthetic Data (SMOTE)	PASS	Quality score validated
Calibration	PASS	Platt scaling: 0-0.56 → 0.003-0.96
FrozenComBat	PASS	Batch correction validated
Multi-Omics	PASS	Combined AUC: 0.827
Deep Learning	SKIPPED	torch not installed

Top Biomarkers: miR-183-5p, miR-145-5p, miR-182-5p are established cancer biomarkers with known oncogenic/tumor suppressor roles.

Overview

OmicSelector is an R package for biomarker discovery that enforces methodologically sound machine learning practices. Built on the mlr3 ecosystem, it guarantees:

Zero Data Leakage: Feature selection occurs strictly inside cross-validation folds
Proper Nested CV: Separation of outer (evaluation) and inner (selection) loops
Feature Stability: Nogueira Stability Index for reproducible biomarker sets
Multi-Objective Selection: Balance performance, stability, and parsimony

Installation

# Install from GitHub
remotes::install_github("kstawiski/OmicSelector")

# Load the package
library(OmicSelector)

Data Format

OmicSelector expects a data.frame with:

Feature columns: Numeric values (gene expression, miRNA counts, etc.)
Target column: Factor or character for classification, numeric for regression
Optional metadata: patient_id, batch, etc.

Example Data Structure

# Your data should look like this:
head(my_data)
#>   gene_A  gene_B  gene_C  gene_D  gene_E  outcome
#> 1   2.34    1.56    3.21    0.89    1.45     Case
#> 2   1.12    2.89    0.45    2.11    0.78  Control
#> 3   3.45    0.23    2.67    1.34    2.90     Case
#> ...

# Required structure:
# - All feature columns must be numeric
# - Target column can be factor, character (classification) or numeric (regression)
# - No row names required (use a patient_id column if needed)

Creating Test Data

# Simulate expression data for 100 samples, 50 genes
set.seed(42)
n_samples <- 100
n_genes <- 50

# Create feature matrix
features <- matrix(rnorm(n_samples * n_genes), nrow = n_samples)
colnames(features) <- paste0("gene_", 1:n_genes)

# Add some signal to first 5 genes
outcome <- factor(rep(c("Case", "Control"), each = n_samples/2))
features[outcome == "Case", 1:5] <- features[outcome == "Case", 1:5] + 1.5

# Combine into data.frame
my_data <- data.frame(features, outcome = outcome)

Multi-Omics Data

For multi-omics, pass a named list of data.frames (same samples, same row order):

# RNA-seq data (100 samples x 1000 genes)
rna_data <- data.frame(matrix(rnorm(100 * 1000), nrow = 100))
names(rna_data) <- paste0("gene_", 1:1000)

# miRNA data (100 samples x 200 miRNAs)
mirna_data <- data.frame(matrix(rnorm(100 * 200), nrow = 100))
names(mirna_data) <- paste0("miR_", 1:200)

# Target vector
outcome <- factor(rep(c("Case", "Control"), each = 50))

# Create multi-omics pipeline
pipeline <- OmicPipeline$new(
  data = list(rna = rna_data, mirna = mirna_data),
  target = outcome,
  positive = "Case"
)
# Features are automatically namespaced: rna::gene_1, mirna::miR_1, etc.

Quick Start

Step 1: Create Pipeline

library(OmicSelector)

# Create pipeline from data
pipeline <- OmicPipeline$new(
  data = my_data,          # Your data.frame
  target = "outcome",      # Name of the target column
  positive = "Case"        # Positive class for AUC calculation
)
#> OmicPipeline created: 100 samples, 50 features, target='outcome'

Step 2: Create GraphLearner

# Create a learner with embedded feature selection
learner <- pipeline$create_graph_learner(
  filter = "anova",        # Feature selection method
  model = "ranger",        # Classification model
  n_features = 10          # Select top 10 features
)

Step 3: Run Nested Cross-Validation

# Create benchmark service
benchmark <- BenchmarkService$new(
  task = pipeline,         # Your pipeline
  outer_folds = 5,         # Outer CV folds (evaluation)
  inner_folds = 3,         # Inner CV folds (selection)
  seed = 42                # For reproducibility
)

# Add learner and run
benchmark$add_learner(learner)
result <- benchmark$run()
#> Running nested cross-validation...
#> Completed in 12.3 seconds

# View results
print(result)

Web UI (Shiny)

Run the interactive WebUI locally:

shiny::runApp("shiny/omicselector", host = "0.0.0.0", port = 3838)

Run via Docker (Shiny WebUI + RStudio Server):

docker compose up

Then open:

WebUI: http://localhost:3838
RStudio Server: http://localhost:8787 (user: rstudio, password: omicselector)

Uploads and results are saved under /OmicSelector/analyses/<analysis_id> inside the container and persisted to ./analyses on the host. Set OMICSELECTOR_ANALYSES_DIR to override the storage location.

Recipes

Binary Classification (Single-Omics)

pipeline <- OmicPipeline$new(
  data = my_data,
  target = "outcome",
  positive = "Case"
)

learner <- pipeline$create_graph_learner(
  filter = "anova",
  model = "rpart",
  n_features = 20
)

fit <- pipeline$fit(learner, seed = 1)
fit$selected_features

Multi-Omics (Named List Input)

pipeline <- OmicPipeline$new(
  data = list(rna = rna_data, mirna = mirna_data),
  target = "outcome",
  positive = "Case"
)

learner <- pipeline$create_graph_learner(
  filter = "mrmr",
  model = "rpart",
  n_features = 30
)

Benchmark (Nested CV + Screening + Caching)

learner <- pipeline$create_graph_learner(
  filter = "anova",
  model = "rpart",
  n_features = 20,
  screening = TRUE,
  screening_frac = 0.2
)

result <- pipeline$benchmark(
  learners = learner,
  outer_folds = 5,
  inner_folds = 3,
  seed = 42,
  cache_dir = "cache",
  parallel = TRUE,
  threads = 1
)

Deep Learning (mlr3torch + Autoencoder)

# Requires: torch + mlr3torch
learner <- pipeline$create_graph_learner(
  filter = "anova",
  model = "mlp",
  n_features = 50,
  autoencoder = list(
    latent_dim = 32,
    hidden_layers = c(128, 64),
    epochs = 50,
    batch_size = 64,
    early_stopping = TRUE,
    patience = 10
  )
)

result <- pipeline$benchmark(
  learners = learner,
  outer_folds = 3,
  inner_folds = 2,
  seed = 1
)

Transfer learning via autoencoder pretraining:

ae <- autoencoder_fit(x_train, latent_dim = 32, epochs = 50)
autoencoder_save(ae, "pretrained_ae.pt")

learner <- pipeline$create_graph_learner(
  filter = "anova",
  model = "mlp",
  n_features = 50,
  autoencoder = list(
    pretrained = "pretrained_ae.pt",
    freeze_encoder = TRUE
  )
)

Checkpoint export/import for fine‑tuning:

fit <- pipeline$fit(learner, seed = 1)
export_omicfit_checkpoint(fit, "mlp_checkpoint.pt")

finetuned <- finetune_mlr3torch_checkpoint(
  learner = learner,
  task = pipeline$get_task(),
  path = "mlp_checkpoint.pt",
  epochs = 30
)

# Or stay at the OmicFit level
finetuned_fit <- finetune_omicfit_checkpoint(
  fit = fit,
  task = pipeline$get_task(),
  path = "mlp_checkpoint.pt",
  epochs = 30
)

Configuration Options

Feature Selection Methods

Use the filter parameter in create_graph_learner():

Univariate Statistical Tests

Method	Code	Description	Best For
ANOVA F-test	`"anova"`	Tests mean differences between classes	Default choice, continuous features
Kruskal-Wallis	`"kruskal"`	Non-parametric rank-based test	Non-normal distributions
Chi-Squared	`"chi_squared"`	Chi-squared test	Discrete/categorical features

Variance-Based

Method	Code	Description	Best For
Variance	`"variance"`	Removes low-variance features	Pre-filtering, removing constants

Correlation-Based

Method	Code	Description	Best For
Correlation	`"correlation"`	Correlation with target	Quick univariate filter

Information-Theoretic

Method	Code	Description	Best For
Information Gain	`"information_gain"`	Entropy-based importance	Mixed feature types
Gain Ratio	`"gain_ratio"`	Normalized information gain	Avoiding bias toward high-cardinality
mRMR	`"mrmr"`	Min Redundancy Max Relevance	Reducing feature redundancy
CMIM	`"cmim"`	Conditional Mutual Info Max	Complex feature dependencies
JMIM	`"jmim"`	Joint Mutual Info Max	Capturing feature interactions
JMI	`"jmi"`	Joint Mutual Information	Similar to JMIM

Model-Based Importance

Method	Code	Description	Best For
AUC	`"auc"`	AUC of univariate models	Classification performance
Relief	`"relief"`	Instance-based algorithm	Detecting interactions
RF Importance	`"importance"`	Random Forest importance	Non-linear relationships
Permutation	`"permutation"`	Permutation importance	Model-agnostic importance

# Example: Try different filters
learner_anova <- pipeline$create_graph_learner(filter = "anova", model = "ranger", n_features = 20)
learner_mrmr <- pipeline$create_graph_learner(filter = "mrmr", model = "ranger", n_features = 20)
learner_relief <- pipeline$create_graph_learner(filter = "relief", model = "xgboost", n_features = 15)

Classification Models

Use the model parameter in create_graph_learner():

Tree-Based Ensembles

Model	Code	Description	Strengths
Random Forest	`"ranger"`	Fast RF implementation	Handles interactions, robust, default choice
XGBoost	`"xgboost"`	Gradient boosting	High performance, handles missing values
LightGBM	`"lightgbm"`	Fast gradient boosting	Very fast, memory efficient
Decision Tree	`"rpart"`	Single CART tree	Interpretable, fast

Linear Models

Model	Code	Description	Strengths
Elastic Net	`"glmnet"`	L1/L2 regularized regression	Interpretable coefficients, feature selection
Logistic Regression	`"log_reg"`	Simple logistic regression	Baseline, highly interpretable

Distance-Based Methods

Model	Code	Description	Strengths
SVM	`"svm"`	Support Vector Machine (RBF)	High-dimensional data
k-NN	`"kknn"`	K-Nearest Neighbors	Non-parametric, simple

Probabilistic Classifiers

Model	Code	Description	Strengths
Naive Bayes	`"naive_bayes"`	Probabilistic classifier	Fast, works with small data
LDA	`"lda"`	Linear Discriminant Analysis	Dimensionality reduction
QDA	`"qda"`	Quadratic Discriminant Analysis	Non-linear decision boundaries

Neural Networks

Model	Code	Description	Strengths
Neural Net	`"nnet"`	Single-layer neural network	Non-linear relationships

# Example: Compare models
learner_rf <- pipeline$create_graph_learner(filter = "anova", model = "ranger", n_features = 15)
learner_xgb <- pipeline$create_graph_learner(filter = "anova", model = "xgboost", n_features = 15)
learner_glmnet <- pipeline$create_graph_learner(filter = "anova", model = "glmnet", n_features = 15)

Using Any mlr3 Learner

OmicSelector is built on the mlr3 ecosystem. You can use any mlr3 learner directly with BenchmarkService:

library(mlr3learners)

# Get the task from the pipeline
task <- pipeline$get_task()

# Create any mlr3 learner with probability predictions
xgboost_learner <- lrn("classif.xgboost", predict_type = "prob")
nnet_learner <- lrn("classif.nnet", predict_type = "prob")

# Wrap in a graph with feature selection (optional)
library(mlr3pipelines)
library(mlr3filters)

xgb_graph <- po("filter", filter = flt("anova"), filter.nfeat = 20) %>>%
  po("learner", xgboost_learner)

# Convert to GraphLearner
xgb_graph_learner <- as_learner(xgb_graph)
xgb_graph_learner$id <- "xgboost_anova_20"

# Use in benchmark
benchmark <- BenchmarkService$new(task = pipeline, outer_folds = 5, seed = 42)
benchmark$add_learner(xgb_graph_learner)
result <- benchmark$run()

See the mlr3learners documentation for all available learners.

Full Parameter Reference

learner <- pipeline$create_graph_learner(
  # Feature Selection
  filter = "anova",           # Filter method (see table above)
  n_features = 20,            # Number of features to select
                              # Use decimal < 1 for proportion (e.g., 0.1 = top 10%)

  # Preprocessing
  impute_method = "median",   # Handle missing values: "median", "mean", "sample", "pmm"
  scale = TRUE,               # Scale features to zero mean, unit variance

  # Class Imbalance (applied INSIDE each CV fold)
  oversample = NULL,          # NULL (none), "smote", or "rose"

  # Batch Effect Correction
  batch_correct = FALSE       # TRUE to apply FrozenComBat (requires batch column)
)

Advanced Imputation: PMM (Predictive Mean Matching)

For complex missing data patterns, use PMM imputation via the mice package:

# PMM uses predictive models to impute missing values
# Best for: MAR (Missing At Random) data, preserving distributions
learner <- pipeline$create_graph_learner(
  filter = "anova",
  model = "ranger",
  n_features = 20,
  impute_method = "pmm"       # Predictive Mean Matching via mice
)

# PMM advantages:
# - Preserves data distribution (unlike mean imputation)
# - Handles complex correlation structures
# - Produces plausible values from observed data
# - Applied properly inside each CV fold to prevent leakage

Comparing Multiple Configurations

# Create benchmark service
benchmark <- BenchmarkService$new(pipeline, outer_folds = 5, inner_folds = 3, seed = 42)

# Add multiple learners with different configurations
benchmark$add_learner(
  pipeline$create_graph_learner(filter = "anova", model = "ranger", n_features = 10),
  id = "anova_rf_10"
)
benchmark$add_learner(
  pipeline$create_graph_learner(filter = "anova", model = "ranger", n_features = 20),
  id = "anova_rf_20"
)
benchmark$add_learner(
  pipeline$create_graph_learner(filter = "mrmr", model = "glmnet", n_features = 15),
  id = "mrmr_glmnet_15"
)
benchmark$add_learner(
  pipeline$create_graph_learner(filter = "variance", model = "svm", n_features = 25),
  id = "var_svm_25"
)

# Run all configurations
result <- benchmark$run()

# Compare performance
print(result$performance)
#>           learner_id classif.auc classif.acc classif.bbrier
#> 1       anova_rf_10       0.892       0.840          0.112
#> 2       anova_rf_20       0.901       0.850          0.098
#> 3   mrmr_glmnet_15        0.885       0.830          0.121
#> 4       var_svm_25        0.878       0.820          0.134

Advanced Features

Phase 5: Cutting-Edge Biomarker Discovery

OmicSelector 2.0 includes state-of-the-art methods validated on TCGA data:

GOF Filters for Sparse/Zero-Inflated Data

Traditional filters (ANOVA, variance) miss biologically important features that are "off" in one condition but expressed in another. GOF filters detect these patterns:

library(OmicSelector)

# Use KS filter - captures distributional differences including zero-inflation
learner <- pipeline$create_graph_learner(
  filter = "gof_ks",      # Kolmogorov-Smirnov GOF filter
  model = "ranger",
  n_features = 30
)

# Other GOF options:
# filter = "hurdle"      # Two-part model (zero-frequency + magnitude)
# filter = "zero_prop"   # Simple zero-proportion difference

# GOF filters discovered miR-200c/miR-141 family in TCGA kidney data
# that ANOVA missed - these are established EMT regulators in cancer

Bayesian Hyperparameter Optimization

Replace grid search with intelligent Bayesian optimization:

# Bayesian-tuned glmnet with omics-optimized search space
autotuner <- make_autotuner_glmnet(
  task,
  n_evals = 20,           # Budget of evaluations
  inner_folds = 3         # Inner CV for tuning
)
autotuner$train(task)

# Get optimal parameters
params <- get_optimal_params(autotuner)
# Returns: alpha, s (lambda) optimized for your data

AutoXAI: Interpretability with Correlation Warnings

Get SHAP-based interpretability with automatic warnings for correlated features:

# Train a model
learner <- lrn("classif.ranger", predict_type = "prob", importance = "impurity")
learner$train(task)

# Run XAI pipeline
xai <- xai_pipeline(
  learner = learner,
  task = task,
  top_k = 20,
  cor_threshold = 0.7     # Warn if features correlated > 0.7
)

# View results
print(xai$top_features)          # Top features by permutation importance
print(xai$correlations$clusters) # Correlated feature clusters
plot_xai_importance(xai)         # Visualization

# Warnings prevent SHAP misinterpretation when features are collinear

Bootstrap Stability Ensemble

Find reproducible biomarkers across resamples:

# Create stability ensemble with multiple filters
ensemble <- create_stability_ensemble(
  preset = "default",     # Uses mRMR, AUC, Information Gain
  n_bootstrap = 100,      # 100 bootstrap iterations
  n_features = 30
)

# Fit and get stable features
ensemble$fit(task, seed = 42)
stable <- ensemble$get_feature_importance(30)

# Features with >90% selection frequency are highly reproducible
# TCGA validation found 6 biomarkers with 100% stability

SMOTE for Imbalanced Data

Proper SMOTE application inside CV folds:

# Balance classes with SMOTE (applied inside each fold)
task_balanced <- smote_augment(task, ratio = 1.0, k = 5)

# Validate synthetic data quality
validation <- validate_synthetic(real_data, synthetic_data)
print(validation$quality_score)  # 0-1, higher is better

Feature Stability Analysis

# Compute Nogueira Stability Index with Bootstrap Confidence Intervals
stability <- compute_stability_from_resample(result$benchmark_result)
print(stability)
#> Nogueira Stability Index: 0.823
#> Interpretation: Good stability

# Bootstrap CI for uncertainty quantification
stability_ci <- compute_nogueira_stability(
  feature_matrix,
  n_bootstrap = 1000,      # Number of bootstrap iterations
  conf_level = 0.95        # 95% confidence interval
)
print(stability_ci)
#> Nogueira Stability: 0.823
#> 95% Bootstrap CI: [0.756, 0.889]
#> Interpretation: Good stability (CI does not include 0.7 threshold)

# Get features selected in ALL folds (consensus)
consensus_features <- stability$consensus_features
print(consensus_features)
#> [1] "gene_1" "gene_3" "gene_5"

# Get selection frequency per feature
head(stability$selection_frequency)
#>   gene_1   gene_3   gene_5   gene_2   gene_4
#>     1.00     1.00     1.00     0.80     0.60

Optimal Signature Selection

# Select best signature balancing multiple objectives
best <- select_best_signature(
  result,
  mode = "weighted",         # Selection mode
  stability_weight = 0.3,    # Weight for stability (vs performance)
  parsimony_weight = 0.1     # Weight for fewer features
)

# Available modes:
# - "best_auc": Highest AUC only
# - "best_stability": Highest Nogueira index only
# - "weighted": Weighted combination (default)
# - "pareto": Pareto-optimal with distance-to-utopia

Model Calibration

# After getting predictions from your model
probabilities <- predict(model, newdata = test_data, type = "prob")
true_labels <- test_data$outcome

# Compute calibration metrics
ece <- compute_ece(probabilities, true_labels)
brier <- decompose_brier(probabilities, true_labels)

# Fit calibration (Platt scaling or isotonic regression)
calibrator <- fit_platt_scaling(probabilities, true_labels)
# or: calibrator <- fit_isotonic_calibration(probabilities, true_labels)

# Apply to new predictions (calibrator is a function)
calibrated_probs <- calibrator(new_probabilities)

Batch Correction (FrozenComBat)

# Include batch information in pipeline
pipeline <- OmicPipeline$new(
  data = my_data,
  target = "outcome",
  batch = "batch_id"         # Column containing batch labels
)

# Enable batch correction (parameters fit on training data only)
learner <- pipeline$create_graph_learner(
  filter = "anova",
  model = "ranger",
  n_features = 20,
  batch_correct = TRUE       # Applies FrozenComBat inside CV
)

SHAP Interpretability

# Create explainer for a trained model
explainer <- create_explainer(trained_learner, task)

# Compute SHAP values
shap <- shap_values(explainer, new_data)

# Check for problematic correlations
# (SHAP can be misleading when features are highly correlated)
correlations <- check_feature_correlations(
  data = my_data,
  features = selected_features,
  threshold = 0.7            # Warn if correlation > 0.7
)

Multi-Omics Late Integration

# Create separate pipelines for each modality
rna_pipeline <- OmicPipeline$new(rna_data, target = outcome)
mirna_pipeline <- OmicPipeline$new(mirna_data, target = outcome)

# Create stacker
stacker <- MultiOmicsStacker$new(
  modalities = list(
    rna = rna_pipeline$get_task(),
    mirna = mirna_pipeline$get_task()
  ),
  meta_learner = mlr3::lrn("classif.glmnet")
)

# Train and predict
stacker$train()
predictions <- stacker$predict(new_data)

Quality Control Validation

OmicSelector includes rigorous QC checks for clinical-grade biomarker discovery:

# Run comprehensive QC validation
qc_results <- project$run_qc_validation()

# QC checks include:
# - Sample size adequacy (minimum samples per class)
# - Dimensionality ratio (features vs samples)
# - Events Per Variable (EPV) for model stability
# - Missing data patterns
# - Feature quality (variance, zero-inflation)

print(qc_results)
#> QC Validation Results:
#>   sample_size:    PASS (n=200, min_class=87)
#>   dimensionality: WARN (p/n ratio = 2.5, high dimensionality)
#>   epv:            PASS (EPV = 8.7, adequate for selected features)
#>   missing_data:   PASS (0.3% missing, handled with PMM)
#>   feature_quality: PASS (no constant features detected)

# Warnings guide corrective action without blocking workflow

Model Card Generation

Generate clinical-ready Model Cards for stakeholder communication:

# After training a final model
model_card <- project$generate_model_card(
  model_name = "Glioblastoma miRNA Classifier",
  intended_use = "Diagnostic aid for glioblastoma classification",
  target_population = "Adult patients with suspected CNS tumors",
  clinical_context = "Pre-surgical tumor characterization"
)

# Model Card includes:
# - Performance metrics with confidence intervals
# - Training data characteristics
# - Intended use and limitations
# - Feature stability assessment
# - TRIPOD+AI compliance notes

# Export as HTML for stakeholders
export_model_card(model_card, "model_card.html")

YAML Configuration

For reproducible batch processing:

# config.yaml
data:
  file: "data/expression.csv"
  target: "outcome"
  positive: "Case"

pipeline:
  filter: "anova"
  model: "ranger"
  n_features: 20
  scale: true
  impute: "median"

benchmark:
  outer_folds: 5
  inner_folds: 3
  seed: 42
  stratify: true

output:
  dir: "./results"
  report_format: "html"

Run via CLI:

Rscript inst/bin/omicselector run --config=config.yaml

Docker

# Build image
docker build -f Dockerfile.core -t omicselector:2.0 .

# Run interactive R session
docker run -it --rm -v $(pwd):/workspace omicselector:2.0 R

# Run analysis with config
docker run -it --rm -v $(pwd):/workspace omicselector:2.0 \
  Rscript inst/bin/omicselector run --config=/workspace/config.yaml

Module Reference

Core Pipeline

Module	Description
`OmicPipeline`	Build mlr3 graphs with zero-leakage preprocessing
`BenchmarkService`	Nested CV with proper inner/outer loop separation
`select_best_signature()`	Multi-objective signature selection
`compute_nogueira_stability()`	Feature selection stability metrics

Batch & Calibration

Module	Description
`FrozenComBat`	Batch correction with frozen parameters
`fit_platt_scaling()`	Platt calibration
`fit_isotonic_calibration()`	Isotonic regression calibration

Phase 5: Advanced Features

Module	Description
`FilterGOF_KS`	Kolmogorov-Smirnov filter for sparse data
`FilterHurdle`	Two-part hurdle model filter
`FilterZeroProp`	Zero-proportion difference filter
`make_autotuner_glmnet()`	Bayesian-tuned elastic net
`make_autotuner_xgboost()`	Bayesian-tuned XGBoost
`make_autotuner_ranger()`	Bayesian-tuned Random Forest
`xai_pipeline()`	DALEX-based interpretability with correlation warnings
`plot_xai_importance()`	Feature importance visualization
`create_stability_ensemble()`	Bootstrap stability feature selection
`smote_augment()`	SMOTE for class imbalance
`validate_synthetic()`	Synthetic data quality validation

Quality Control & Clinical Reporting

Module	Description
`validate_data_quality()`	Sample size, dimensionality, EPV checks
`generate_model_card()`	Clinical-ready Model Card for stakeholders
`generate_tripod_report()`	TRIPOD+AI compliant reporting

Multi-Omics & Export

Module	Description
`MultiOmicsStacker`	Late integration of multi-omics data
`export_vetiver()`	Model export for deployment
`export_bundle()`	Self-contained model bundle with metadata
`export_onnx()`	ONNX export for cross-platform deployment

Citation

@article{stawiski2022omicselector,
  title={OmicSelector: automatic feature selection and deep learning
         modeling for omic experiments},
  author={Stawiski, Konrad and Kaszkowiak, Marcin and Mikulski, Damian and others},
  journal={bioRxiv},
  year={2022},
  doi={10.1101/2022.06.01.494299}
}

Authors

Konrad Stawiski, M.D., Ph.D. (konrad.stawiski@umed.lodz.pl)
Marcin Kaszkowiak, M.D.
Damian Mikulski, M.D.

Supervised by: Prof. Wojciech Fendler, M.D., Ph.D.

Department of Biostatistics and Translational Medicine, Medical University of Lodz, Poland

Name		Name	Last commit message	Last commit date
Latest commit History 500 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
R		R
article		article
data-raw		data-raw
data		data
docker		docker
example		example
extensions		extensions
inst		inst
man		man
old_files_to_delete		old_files_to_delete
pkgdown		pkgdown
renv		renv
scripts		scripts
shiny/omicselector		shiny/omicselector
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.Rprofile		.Rprofile
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
DESCRIPTION		DESCRIPTION
Dockerfile		Dockerfile
LICENSE		LICENSE
NAMESPACE		NAMESPACE
OmicSelector.Rproj		OmicSelector.Rproj
README.md		README.md
_config.yml		_config.yml
_pkgdown.yml		_pkgdown.yml
build.R		build.R
docker-compose.yml		docker-compose.yml
index.md		index.md
install.sh		install.sh
pal_generated.code		pal_generated.code

License

kstawiski/OmicSelector

Folders and files

Latest commit

History

Repository files navigation

OmicSelector 2.0

Validation Status

Overview

Installation

Data Format

Example Data Structure

Creating Test Data

Multi-Omics Data

Quick Start

Step 1: Create Pipeline

Step 2: Create GraphLearner

Step 3: Run Nested Cross-Validation

Web UI (Shiny)

Recipes

Binary Classification (Single-Omics)

Multi-Omics (Named List Input)

Benchmark (Nested CV + Screening + Caching)

Deep Learning (mlr3torch + Autoencoder)

Configuration Options

Feature Selection Methods

Classification Models

Using Any mlr3 Learner

Full Parameter Reference

Advanced Imputation: PMM (Predictive Mean Matching)

Comparing Multiple Configurations

Advanced Features

Phase 5: Cutting-Edge Biomarker Discovery

GOF Filters for Sparse/Zero-Inflated Data

Bayesian Hyperparameter Optimization

AutoXAI: Interpretability with Correlation Warnings

Bootstrap Stability Ensemble

SMOTE for Imbalanced Data

Feature Stability Analysis

Optimal Signature Selection

Model Calibration

Batch Correction (FrozenComBat)

SHAP Interpretability

Multi-Omics Late Integration

Quality Control Validation

Model Card Generation

YAML Configuration

Docker

Module Reference

Core Pipeline

Batch & Calibration

Phase 5: Advanced Features

Quality Control & Clinical Reporting

Multi-Omics & Export

Citation

Authors

Links

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Uh oh!

Contributors 3

Uh oh!

Languages