Author
- Anna Chesnokova
Supervisor
- Ivan Valiev
Tobacco Smoking History Status:
- Lifelong Non-smoker (less than 100 cigarettes smoked in Lifetime) = 1
- Current smoker (includes daily smokers and non-daily smokers or occasional smokers) = 2
- Current reformed smoker for > 15 years (greater than 15 years) = 3
- Current reformed smoker for ≤15 years (less than or equal to 15 years) = 4
- Current reformed smoker, duration not specified = 5
Lung cancer remains one of the most prevalent and deadly forms of cancer worldwide 1. Smoking is a major risk factor, influencing both the initiation and progression of lung cancer, creating certain patterns of mutations and gene expression 2. Recent advancements in genomics and transcriptomics have provided valuable insights into the molecular mechanisms underlying smoking-related carcinogenesis 34. This study aims to develop a machine learning model to predict the smoking status of lung cancer patients based on their genetic and transcriptomic profiles. The primary goal is to enhance the accuracy of classification by integrating transcription factors (TFs), differential gene expression data (DEGs) and mutational signatures.
Gene expression, smoking status data and mutation annotation files were retrieved from the database UCSC XENA 5.
In order to use .ipnyb notebooks you will need to download 'requirements.txt'.
-
Obtaining columns with tumor type and smoking status from 'tcga.tcga.LUAD.MD': 'transpose_ID_SmokingHistory.csv' and 'transpose_ID_SmokingHistory.csv'.
-
For the read count tcga.gene_sums.LUAD.R109.gz (forced compressed for github), a file was obtained to translate the gene IDs gene_annotation_table.txt from the gene annotation file human.gene_sums.R109.gtf.
Script for data preparation and calculation of differential gene expression: DESeq_LUAD.R.
-
Select only primery and recurent tumors.
-
Prepare Metadata for DESeq2 6 and create a table for machine learning models (data_t_sort_all.txt.gz, (forced compressed for github)).
-
Obtaining train id (train_id.csv) for calculating differential gene expression.
-
Create the DESeq2 dataset using the counts matrix and metadata.
-
Normalize the data using size factors.
-
Saving DESeq2 results: DESeq_res.csv.
-
Save the top of 250 differentially expressed genes to CSV files (padj < 0.05): DEG_up.csv, DEG_up.csv.
Construction of a logistic regression model using the expression of transcription factors and DEGs as an example, similar was done with other chipsets: DF_TF_logreg.ipynb.
-
Prepare data and obtain train IDs for DESeq2.
-
Count tpm for genes with the realized function
read_counts2tpm(), that uses gene lengths (gene_length_1.txt) obtained from the gene annotation file (human.gene_sums.R109.gtf) using a script gene_length.bash. -
Save the table with tpm value: X_tpm_DF.csv.
-
Build logistic regression models for DEGs (tpm value and log10+1 tpm value).
-
Build logistic regression models for TFs obtained from the Enrichr database (tpm value and log10+1 tpm value) 7.
The section of work with mutation annotation files is located in the folder /maf.
-
In the original file (TCGA-LUAD.mutect2_snv.tsv) 8 select only single-nucleotide variants (TCGA-LUAD.mutect2_sbs.tsv) using a script filter_sbs.sh.
-
With a script count_signatures.R using the library
deconstructSigs9 we will create a three nucleotide matrix and calculate the weight of each signature combined_table_signatures_tobacco.tsv.
Сonstruction of models using the expression of DEGs with the weight of each signature as an example, similar was done with other chipsets: DEG_SIGS_models.ipynb.
-
Replace patient IDs and combine data from DEGs and mutation-signature weights.
-
Build machine learning: Logistic Regression, Decision Tree Classifier, Random Forest Classifier, and LGBM Classifier 10.
-
Calculating shap values.
Figure 1. PCAplot on normalized read counts.
Figure 2. PCAplot by normalized log2 of the number of reads.
-
RHOXF2
-
GATA1
-
CTCFL
-
MAEL
-
TFDP3
-
ZNF595
The results indicate that DEGs and mutational signatures are powerful predictors of smoking history in lung cancer patients. TFs, while biologically significant, did not provide sufficient discriminatory power on their own, likely due to the complexity of transcriptional regulation and the indirect nature of TF activity. The superior performance of the combined feature set underscores the importance of integrating multiple data types to capture the multifaceted impact of smoking on lung cancer biology. Future studies should focus on validating these findings in larger independent cohorts, improving machine learning models and exploring the potential for clinical application in personalized medicine.
Footnotes
-
Siegel, R. L., Miller, K. D., and Jemal, A. "Cancer statistics, 2020." CA: A Cancer Journal for Clinicians, 2020. doi:10.3322/caac.21590. ↩
-
Govindan R. et al. Genomic landscape of non-small cell lung cancer in smokers and never-smokers //Cell. – 2012. – Т. 150. – №. 6. – С. 1121-1134, doi:10.1016/j.cell.2012.08.024. ↩
-
Alexandrov, L. B., Nik-Zainal, S., Wedge, D. C., et al. "Signatures of mutational processes in human cancer." Nature, 2013. doi:10.1038/nature12477. ↩
-
Goldman, M.J., Craft, B., Hastie, M. et al. Visualizing and interpreting cancer genomics data via the Xena platform. Nat Biotechnol (2020), https://doi.org/10.1038/s41587-020-0546-8. ↩
-
Love MI, Huber W, Anders S (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15, 550, doi:10.1186/s13059-014-0550-8. ↩
-
Tate J. G. et al. COSMIC: the catalogue of somatic mutations in cancer //Nucleic acids research. – 2019. – Т. 47. – №. D1. – С. D941-D947, https://doi.org/10.1093/nar/gky1015. ↩
-
Kuleshov, M. V., Jones, M. R., Rouillard, A. D., et al. "Enrichr: a comprehensive gene set enrichment analysis web server 2016 update." Nucleic Acids Research, 2016. doi:10.1093/nar/gkw377. ↩
-
Rosenthal, R., McGranahan, N., Herrero, J., et al. "DeconstructSigs: delineating mutational processes in single tumors distinguishes DNA repair deficiencies and patterns of carcinoma evolution." Genome Biology, 2016. doi:10.1186/s13059-016-0893-4. ↩
-
Pedregosa, F., Varoquaux, G., Gramfort, A., et al. "Scikit-learn: Machine learning in Python." Journal of Machine Learning Research, 2011. ↩
-
Lambert S. A. et al. The human transcription factors //Cell. – 2018. – Т. 172. – №. 4. – С. 650-665, https://doi.org/10.1016/j.cell.2018.01.029. ↩




