This final project for the NYU Advanced Integratic Omics course sought to use a multi-omic approach to predict EC paitient survival using normalized RNA read counts, as well as relevant abundance of protein and phosphosites identified with Tandem Mass Spectrometry (TMT), measured using z-statistic and adjusted log2-ratio respectively. Dimensionality reduction of the omics data was performced using variance thresholding follwed by one of two methods, either Lasso with cross-validation (LassoCV), or recurrent feature elimination with cross-validation (RFECV).
- Train-test split: Progression-free survival data (PFS) was split 80-20, and only samples with corresponing PFS data were retained
- Sparse column removal: Only genes that were measured in at least 75% of samples were retained
- Data imputation: The remaining missing data were imputed by fitting a K Nearest Neighbors (KNN) Imputer with number of neighbors parameter set to 3 on the training dataset and applied to both training and testing datasets
- Variance Thresholding: Top N most variable columns of training datasets were selected, and used to filter testing dataset
- RNA: n=5000
- Protein: n=6000
- Phosphosite: n=5000
- Scaling: A standard scaler was fit to the training dataset
Lasso CV was applied to training dataset with parameters cv=5 and max_iter=5000. The fitted LassoCV object was then applied to the preprocessed testing dataset to generate, root mean square error (RMSE),
An adjusted range of variance cutoffs were used, started at 20, and increasing by increments of 40 until the results converged on a fixed set of genes.
RFECV was applied to training dataset with the following parameters...
- cv=5
- estimator=Lasso()
- scoring='r2'
- step=10
The fitted RFECV object was then applied to the preprocessed testing dataset to generate, root mean square error (RMSE),
- Normalized read counts for 24,595 genes (cols) across 135 tumor and 26 NAT samples (rows)
- Relevant abundance (z-statistic) of 9,600 TMT-identified proteins
- Relevant abundance (adjusted log2-ratio) of 41,448 phosphosites
The dataset for this project may be found in the Pride archive PXD055203
Yu J, Gui X, Zou Y, Liu Q, Yang Z, An J, Guo X, Wang K, Guo J, Huang M, Zhou S, Zuo J, Chen Y, Deng L, Yuan G, Li N, Song Y, Jia J, Zeng J, Zhao Y, Liu X, Du X, Liu Y, Wang P, Zhang B, Ding L, Robles AI, Rodriguez H, Zhou H, Shao Z, Wu L, Gao D. A proteogenomic analysis of cervical cancer reveals therapeutic and biological insights. Nat Commun. 2024 Nov 22;15(1):10114. doi: 10.1038/s41467-024-53830-0. PMID: 39578447; PMCID: PMC11584810.