Investigating methods to undertake feature selection and reduction on RNA-seq data.
Data from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE54460 https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE54460&format=file&file=GSE54460%5FFPKM%2Dgenes%2DTopHat2%2D106samples%2D12%2D4%2D13%2Etxt%2Egz
A large number of Gleason score 7s (80/105 -> 10/35) have been temp removed as they were causing bias (with 23,281 features).
Draft scripts for discovering best methods for feature selection with RNAseq data (using sklearn package).
To run:
$ python3 test_script.py
Mehods include:
- Removing low variance
- Univariate filter
- High correlated features filter
- Low target correlation filter
- Recursive feature elimination
- Feature selection from model
- Tree based selection
- L1-based selection
Also PCA analysis
Multilabel confusion matrix (normalised for true data):

Feature cross validation scores are visualised in order of method used for feature selection at the end:

For the high correlation filter, a heatmap is generated:

Feature importance is also extracted and plotted at each step:

