Endometrial Carcinoma (EC) Survival Prediction

Summary

This final project for the NYU Advanced Integratic Omics course sought to use a multi-omic approach to predict EC paitient survival using normalized RNA read counts, as well as relevant abundance of protein and phosphosites identified with Tandem Mass Spectrometry (TMT), measured using z-statistic and adjusted log2-ratio respectively. Dimensionality reduction of the omics data was performced using variance thresholding follwed by one of two methods, either Lasso with cross-validation (LassoCV), or recurrent feature elimination with cross-validation (RFECV).

Methods

Individual dataset preprocessing

Train-test split: Progression-free survival data (PFS) was split 80-20, and only samples with corresponing PFS data were retained
Sparse column removal: Only genes that were measured in at least 75% of samples were retained
Data imputation: The remaining missing data were imputed by fitting a K Nearest Neighbors (KNN) Imputer with number of neighbors parameter set to 3 on the training dataset and applied to both training and testing datasets
Variance Thresholding: Top N most variable columns of training datasets were selected, and used to filter testing dataset
- RNA: n=5000
- Protein: n=6000
- Phosphosite: n=5000
Scaling: A standard scaler was fit to the training dataset

LassoCV

Lasso CV was applied to training dataset with parameters cv=5 and max_iter=5000. The fitted LassoCV object was then applied to the preprocessed testing dataset to generate, root mean square error (RMSE), $R^2$, and c-index statistics

RFECV

An adjusted range of variance cutoffs were used, started at 20, and increasing by increments of 40 until the results converged on a fixed set of genes.

RFECV was applied to training dataset with the following parameters...

cv=5
estimator=Lasso()
scoring='r2'
step=10

The fitted RFECV object was then applied to the preprocessed testing dataset to generate, root mean square error (RMSE), $R^2$, and c-index statistics

Results

Data

Normalized read counts for 24,595 genes (cols) across 135 tumor and 26 NAT samples (rows)
Relevant abundance (z-statistic) of 9,600 TMT-identified proteins
Relevant abundance (adjusted log2-ratio) of 41,448 phosphosites

The dataset for this project may be found in the Pride archive PXD055203

Reference Paper

Yu J, Gui X, Zou Y, Liu Q, Yang Z, An J, Guo X, Wang K, Guo J, Huang M, Zhou S, Zuo J, Chen Y, Deng L, Yuan G, Li N, Song Y, Jia J, Zeng J, Zhao Y, Liu X, Du X, Liu Y, Wang P, Zhang B, Ding L, Robles AI, Rodriguez H, Zhou H, Shao Z, Wu L, Gao D. A proteogenomic analysis of cervical cancer reveals therapeutic and biological insights. Nat Commun. 2024 Nov 22;15(1):10114. doi: 10.1038/s41467-024-53830-0. PMID: 39578447; PMCID: PMC11584810.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
advanced_integrative_omics_final_project.ipynb		advanced_integrative_omics_final_project.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Endometrial Carcinoma (EC) Survival Prediction

Summary

Methods

Individual dataset preprocessing

LassoCV

RFECV

Results

Data

Reference Paper

About

Uh oh!

Releases

Packages

Languages

draftd01/EC-Survival-Prediction

Folders and files

Latest commit

History

Repository files navigation

Endometrial Carcinoma (EC) Survival Prediction

Summary

Methods

Individual dataset preprocessing

LassoCV

RFECV

Results

Data

Reference Paper

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages