Skip to content

kearseya/RNAseq_ML

Repository files navigation

RNAseq_ML

Investigating methods to undertake feature selection and reduction on RNA-seq data.

Data from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE54460 https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE54460&format=file&file=GSE54460%5FFPKM%2Dgenes%2DTopHat2%2D106samples%2D12%2D4%2D13%2Etxt%2Egz

A large number of Gleason score 7s (80/105 -> 10/35) have been temp removed as they were causing bias (with 23,281 features).

Preliminary data generated from manual methods script

Draft scripts for discovering best methods for feature selection with RNAseq data (using sklearn package). To run:
$ python3 test_script.py

Mehods include:

  • Removing low variance
  • Univariate filter
  • High correlated features filter
  • Low target correlation filter
  • Recursive feature elimination
  • Feature selection from model
  • Tree based selection
  • L1-based selection

Also PCA analysis

Multilabel confusion matrix (normalised for true data):
MlCM

Validation curve:
validation curve example

Feature cross validation scores are visualised in order of method used for feature selection at the end: cross validation scores example

For the high correlation filter, a heatmap is generated: heatplot example

Feature importance is also extracted and plotted at each step: relative feature importance example

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published