This is a step-by-step guide on how to reproduce exactly the analysis performed in the paper.
Note #1: To run all steps, you can type bash run_all_steps.sh at the command line while inside
This directory. This script presumes that you've already run Fatiscan and FatiGO. If you do not have the fatiscan_output
and fatigo_output directories, the script currently breaks at step 7 or step 9, respectively.
Note #2: Each bash script for each step assumes that they are run from their directory. The scripts will break otherwise.
Note #3: The annotations and the TCGA data are not included with this repository because of their size (several GB total). This pipeline assumes that you do step 1a and 1b.
- run
bash step_0.shto install the necessary packages (usesudo bash step_0.shto install system-wide)- this presumes that R is already installed: to run this particular analysis, you need to install R-3.1.3
- If you need to install it, follow these instructions:
- download R-3.1.3 from here: https://cran.r-project.org/src/base/R-3/R-3.1.3.tar.gz
- run
tar -xzf R-3.1.3.tar.gzin the directory with the downloaded tarball. - navigate into the
R-3.1.3directory, and then run either of the following set of commands:- if you have no other versions of R, run:
./configureand thenmake, thenmake install - if you have other versions of R, run
./configure --prefix=/some/other/pathbefore runningmakeandmake install
- if you have no other versions of R, run:
- be sure to use the right path to Rscript in the following steps
Note: you may need to change the top line of the install_packages.R script to reflect the proper path to Rscript. If there is an error, you can check the path by typing which Rscript at the command line.
Note: you will need to change the directory paths at the beginning of each bash script currently to run this on your own machine.
- LUAD: 441 (19 control samples, 422 tumor samples)
- use
luad_file_manifest.txtwith the GDC data transfer tool to download the same files yourself
- use
- LUSC: 367 (37 control samples, 330 tumor sample)
- use
lusc_file_manifest.txtwith the GDC data transfer tool to download the same files yourself
- use
The easiest way to do this is to follow the steps in the tcga_file_manifests directory using the
GDC data transfer tool. Click here to follow those steps.
If you download the data manually, please keep in mind the following note:
NOTE: For the downstream steps to work without modification, you must rename the top level directories for this data
as LUAD_data and LUSC_data, respectively, and place these directories in this repository.
If you wish to use different names, or place directories elsewhere on your computer, you need to modify the
LUAD_DIR and LUSC_DIR variables at the top of the following scripts: step_2c.sh, step_3.sh, step_4.sh, step_5.sh, step_7.sh, and step_9.sh
- RNA: UCSC, hg19, June 2011; miRNA: miRBase v21 and miRanda 08/2010 release
- run
step_1b.shto do two things:- download UCSC tables, the miRNA annotations and initial miRanda miRNA-target predictions
- generate the table that matches pathway IDs to intelligible pathway names NOTE: this download takes up approximately 880MB of space
- run
step_2a.shto prepare the UCSC annotations to facilitate downstream analyses Output: a combined kgXref table to facilitate mapping UCSC IDs to gene names
- run
step_2b.shOutputs:- a list of all miRNAs (and accessions) for both the target matrix and miRBase v21
- a list of all targets
- a matrix of interactions (1 if yes, 0 if no)
- run
step_2c.sh - this automatically runs on both LUAD and LUSCS directories
Outputs:
- miRNA count matrix ready for step 3 is in
(LUAD|LUSC)_datadirectory - miRNA counts ready for 4 are in
(LUAD|LUSC)_data/miRNASeq/BCGSC__50/Level_3/compressed_miRNAcountsdirectory - RNA normalized isoform counts ready for steps 3 and 4 are in
(LUAD|LUSC)_data/RNASeqV2/UNC__58/Level_3/TCGA_isoform_normalized_resultsdirectory
- miRNA count matrix ready for step 3 is in
- run
step_3.shOutput: DESeq2 results are in(LUAD|LUSC)_data/deseq_resultsdirectories. The *Rank.txt file is the one used in step 6 for the Fatiscan analysis.
- run
step_4.shOutput: the promise RData will be stored in a directory calledoutput
- run
step_5.shto process the ProMISe results Output: this produces a GMT gene sets file with transcripts, and an extended gene set file. The latter is used for step 6, along with the rank file generated by step 3.
This requires manual manipulation of files
- Make an account on Babelomics v4 (http://v4.babelomics.org)
- Upload the extended miRNA-target gene set file with data type
Annotation > Extended Annotation - Upload the DESeq2 rank list of transcripts with data type
ID List > Ranked - Choose
Functional Analysis, thenGene Set AnalysisunderSet Enrichment Analysis - Choose the rank list for input data; choose
fatiscanfor test, and choosetwo-tailed - choose
your own annotationsand select the extended list
Precomputed results from Fatiscan are found in the fatiscan_output directory.
- first download the results from the Babelomics website. On the bottom of the results page, click
Download Job - move the resulting job folder to the
(LUAD|LUSC)_datadirectory and rename itfatiscan_output - run
step_7.shOutput: apost_babelomics_resultsdirectory with processed results for up-regulated and down-regulated genes separately- in these directories (specifically, the
downregulatedGenes/up-regulated_miRNAs/andupregulatedGenes/down-regulated_miRNAs), you'll find*FatigoInput.txtlists of the genes we'll use as input for FatiGO in step 8.
- in these directories (specifically, the
Like step 6, this also requires manual input of data
- Upload the up-regulated and down-regulated list of mRNA targets as
ID List > Gene - Choose
Functional AnalysisthenFatiGOunderSingle Enrichment Analysis - Choose
Id list vs Rest of Genome - Choose the up- or down-regulated list (run separate analysis for each)
Remove duplicates?should be set toRemove from list2 those appearing in list1 (complementary list)- For databases, choose
humanfor organism, and check boxes forGO biological process,GO cellular component,GO molecular function,KEGG pathways,Reactome, andBiocarta. Note: For this step, if you do both LUSC and LUAD, you'll run four separate jobs.
Precomputed results from this step are found in the fatigo_output directory
- first download the results from the Babelomics website. Like before, you click on
Download job - in the
(LUAD|LUSC)_datadirectory, create a directory calledfatigo_out - move the resulting job folder to the
fatigo_outand rename it eitherupGeneordownGene, depending on whether the FatiGO job focused on up-regulated or down-regulated genes. - run
step_9.shOutput: PDF files of the hive plots.
Note: You'll have to manually add in annotations afterward using the tables and Adobe, Preview, or Powerpoint.
Note 2: If there are any issues with reproducing this analysis, please submit an issue on Github.