Skip to content

draftd01/EC-Survival-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Endometrial Carcinoma (EC) Survival Prediction

Summary

This final project for the NYU Advanced Integratic Omics course sought to use a multi-omic approach to predict EC paitient survival using normalized RNA read counts, as well as relevant abundance of protein and phosphosites identified with Tandem Mass Spectrometry (TMT), measured using z-statistic and adjusted log2-ratio respectively. Dimensionality reduction of the omics data was performced using variance thresholding follwed by one of two methods, either Lasso with cross-validation (LassoCV), or recurrent feature elimination with cross-validation (RFECV).

Methods

Individual dataset preprocessing

  • Train-test split: Progression-free survival data (PFS) was split 80-20, and only samples with corresponing PFS data were retained
  • Sparse column removal: Only genes that were measured in at least 75% of samples were retained
  • Data imputation: The remaining missing data were imputed by fitting a K Nearest Neighbors (KNN) Imputer with number of neighbors parameter set to 3 on the training dataset and applied to both training and testing datasets
  • Variance Thresholding: Top N most variable columns of training datasets were selected, and used to filter testing dataset
    • RNA: n=5000
    • Protein: n=6000
    • Phosphosite: n=5000
  • Scaling: A standard scaler was fit to the training dataset

LassoCV

image

Lasso CV was applied to training dataset with parameters cv=5 and max_iter=5000. The fitted LassoCV object was then applied to the preprocessed testing dataset to generate, root mean square error (RMSE), $R^2$, and c-index statistics

RFECV

image

An adjusted range of variance cutoffs were used, started at 20, and increasing by increments of 40 until the results converged on a fixed set of genes.

RFECV was applied to training dataset with the following parameters...

  • cv=5
  • estimator=Lasso()
  • scoring='r2'
  • step=10

The fitted RFECV object was then applied to the preprocessed testing dataset to generate, root mean square error (RMSE), $R^2$, and c-index statistics

Results

Data

  1. Normalized read counts for 24,595 genes (cols) across 135 tumor and 26 NAT samples (rows)
  2. Relevant abundance (z-statistic) of 9,600 TMT-identified proteins
  3. Relevant abundance (adjusted log2-ratio) of 41,448 phosphosites

The dataset for this project may be found in the Pride archive PXD055203

Reference Paper

Yu J, Gui X, Zou Y, Liu Q, Yang Z, An J, Guo X, Wang K, Guo J, Huang M, Zhou S, Zuo J, Chen Y, Deng L, Yuan G, Li N, Song Y, Jia J, Zeng J, Zhao Y, Liu X, Du X, Liu Y, Wang P, Zhang B, Ding L, Robles AI, Rodriguez H, Zhou H, Shao Z, Wu L, Gao D. A proteogenomic analysis of cervical cancer reveals therapeutic and biological insights. Nat Commun. 2024 Nov 22;15(1):10114. doi: 10.1038/s41467-024-53830-0. PMID: 39578447; PMCID: PMC11584810.

About

Multi-omic approach for predicting EC survival

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published