-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Labels
Description
We need a pipeline for preprocessing steps in assessing data quality and data cleaning before running the predictor. Currently there is no such mechanism in place. Operations pipeline would run:
- identify structure in missingness of data
- identify and flag outlier samples
- run some unsupervised analyses on the samples. e.g. pca, hierarchical clustering
- For continuous-valued data, compare several similarity metrics to find one which best separates classes. e.g. RNAcorr.R written by SP for PanCancer
- Hierarchical clustering of classes and PCA, following same idea.
- Running univariate test to prune matrix of variables that goes into netDx.