1. "scripts/HEK293/1_genome_and_annotations"
- Build STAR genome with hg38 for mapping
- Find genomic-A stretches and merge with ENCODE blacklist
- Prepare GENCODE v46 annotations for metagenes and counting transcription start site (TSS), gene body, gene end and readthrough regions
- Prepare housekeeping genes from RSeQC for counting in quality control steps
2. "scripts/HEK293/2_preprocessing_and_qualityControl"
- Read quality control via FASTQC
- Adapter and quality trimming using bbduk.sh
- Map the reads to the built STAR genome
- Report quality control via MultiQC
- Principal component analysis via deepTools
3. "scripts/HEK293/3_filtering_normalisation_and_merging"
- Filter blacklist, genomic A stretches and top/bottom 0.5% signal as "outliers" skewing the data using bedtools and awk
- Count the spike-in reads via featureCounts
- Estimate size factors from spike-ins using DESeq2
- Merge replicates of the same libraries to enhance the signal via bedtools and awk
4. "scripts/HEK293/4_subtacting_noPAP_from_bPAP"
- Count the reads at the 3' end region to normalise between +bPAP and no bPAP libraries using bedtools since they have different library complexity with an expectation of a fairly similar profile of polyadenylated transcripts at the 3' end
- Estimate size factors from 3' end regions using DESeq2 and apply them via bedtools and awk
- Subtract no bPAP from +bPAP at single nucleotide resolution using deepTools
- Remove negatives emerged because of the unmatched regions between +bPAP and no bPAP, and quantify them using awk, ggplot2 and ggpubr
- Generate bedgraphs to view on Integrative Genomics Viewer (IGV) with negative values for the reverse strand using awk
5. "scripts/HEK293/5_metagene_analysis"
- Compute and plot metagenes using deepTools
6. "scripts/HEK293/6_spliceJunction_analysis"
- Calculate splicing efficiency using SPLICE-q
- Calculate mean splicing efficiency per sample and visualise the distributions using ggplot2 and ggpubr
7. "scripts/HEK293/7_synthesis_decay_pausing_rate_analysis"
- Count the reads at TSS, gene body, gene end and readthrough regions using bedtools map by summing the signal in the processed bedgraph files
- Prepare a counts table involving all region counts and all samples and normalise the summed signal dividing by the width of the regions
- Calculate synthesis & decay rates, pausing index and termination efficiency. k-means clustering the genes based on the synthesis rate and visualise them using ggpubr
8. "scripts/HEK293/8_TSS_enrichment_in_bPAP_over_noPAP"
- Calculate the median fold change of the signal in +bPAP over no bPAP at the TSS region
9. "scripts/HEK293/9_comparison_with_TTseq_PROseq_NETseq"
- Put Phil's TT-Seq GSM5452296 and publicly available TT-Seq GSM4730176, PRO-Seq GSM4730174 and mNET-Seq GSM7990390 data in HEK293 cells to the same scale
- Compute and plot metagenes of SNU-Seq, TT-Seq, PRO-Seq and mNET-Seq using deepTools
- Generate negative reverse strand bedgraph to visualise Phil's TT-Seq data on IGV
10. "scripts/HEP3B/1_prepare_ATACseqPeaks_and_FANTOM5_forMetagenes"
- Prepare Anna's ATAC-Seq GSE172053 peaks and FANTOM5 enhancers data in a similar way to the genome annotation preparation in HEK293 cells to get clean/non-intersecting regions for a reliable metagene representation
11. "scripts/HEP3B/2_prepare_SNUseq_ChIPseq_data"
- Prepare Anna's SNU-Seq and ChIP-Seq of H3K27ac and H3K4me3 data from GSE172053 by file type conversions, concatenation and log2 transformation
- Calculate H3K27ac to H3K4me3 ratio using deepTools and summarise the resulting ratios
12. "scripts/HEP3B/3_signal_presence_in_ATACpeak_and_FANTOM5_regions"
- Determine the SNU-seq, H3K27ac and H3K4me3 signal and quantify the distribution of them on the ATAC-Seq peaks and FANTOM5 regions
13. "scripts/HEP3B/4_metagene_analysis"
- Sort the ATAC-Seq peaks and FANTOM5 annotations accordingly, and compute and plot metagenes using deepTools
This pipeline utilizes multiple Conda environments to manage dependencies for different stages of the analysis. All environment configuration files are located in the envs/ directory.
| Analysis Stage | Environment File | Key Tools |
|---|---|---|
| DeepTools | envs/deeptools_env.yml |
DeepTools, Python 3.8 |
| QC aggregation | envs/multiqc_env.yml |
MultiQC |
| Genomic-A flagging | envs/py27.yml |
Python 2.7 scripts |
| Splicing analysis | envs/spliceQ_env.yml |
SPLICE-q |
To replicate a specific environment, use the following command structure:
# Example: Creating the DeepTools environment (via mamba, it is directly replacable with "conda")
mamba env create -f envs/deeptools_env.yml
# Activate the environment (via mamba, it is directly replacable with "conda")
mamba activate deeptools_envR packages are included within the respective conda YAML files where possible. For most of the R and package versions used in the scripts, please refer to envs/R_versions.txt.
- manschmi/MexNab_3seq
- manschmi/MS_Metagene_Tools
- nf-core/rnaseq
- See the used R packages and other tools in the scripts and under the
envs/directory.