Skip to content

ugerlevik/SNU-seq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 

Repository files navigation

SNU-Seq Data Processing and Analysis

These scripts are used in the following study:

Gerlevik, U., Lorenz, P., Lamstaes, A., Fischl, H., Xi, S., Saukko-Paavola, A., Murray, S., Brown, T., Welch, A., George, C., Angel, A., Furger, A., Mellor, J. (2025). "Single Nucleotide Resolution 4sU Sequencing (SNU-Seq) reveals the transcriptional responsiveness of an epigenetically primed human genome". bioRxiv.

Scripts in order

1. "scripts/HEK293/1_genome_and_annotations"

2. "scripts/HEK293/2_preprocessing_and_qualityControl"

  • Read quality control via FASTQC
  • Adapter and quality trimming using bbduk.sh
  • Map the reads to the built STAR genome
  • Report quality control via MultiQC
  • Principal component analysis via deepTools

3. "scripts/HEK293/3_filtering_normalisation_and_merging"

  • Filter blacklist, genomic A stretches and top/bottom 0.5% signal as "outliers" skewing the data using bedtools and awk
  • Count the spike-in reads via featureCounts
  • Estimate size factors from spike-ins using DESeq2
  • Merge replicates of the same libraries to enhance the signal via bedtools and awk

4. "scripts/HEK293/4_subtacting_noPAP_from_bPAP"

  • Count the reads at the 3' end region to normalise between +bPAP and no bPAP libraries using bedtools since they have different library complexity with an expectation of a fairly similar profile of polyadenylated transcripts at the 3' end
  • Estimate size factors from 3' end regions using DESeq2 and apply them via bedtools and awk
  • Subtract no bPAP from +bPAP at single nucleotide resolution using deepTools
  • Remove negatives emerged because of the unmatched regions between +bPAP and no bPAP, and quantify them using awk, ggplot2 and ggpubr
  • Generate bedgraphs to view on Integrative Genomics Viewer (IGV) with negative values for the reverse strand using awk

5. "scripts/HEK293/5_metagene_analysis"

  • Compute and plot metagenes using deepTools

6. "scripts/HEK293/6_spliceJunction_analysis"

  • Calculate splicing efficiency using SPLICE-q
  • Calculate mean splicing efficiency per sample and visualise the distributions using ggplot2 and ggpubr

7. "scripts/HEK293/7_synthesis_decay_pausing_rate_analysis"

  • Count the reads at TSS, gene body, gene end and readthrough regions using bedtools map by summing the signal in the processed bedgraph files
  • Prepare a counts table involving all region counts and all samples and normalise the summed signal dividing by the width of the regions
  • Calculate synthesis & decay rates, pausing index and termination efficiency. k-means clustering the genes based on the synthesis rate and visualise them using ggpubr

8. "scripts/HEK293/8_TSS_enrichment_in_bPAP_over_noPAP"

  • Calculate the median fold change of the signal in +bPAP over no bPAP at the TSS region

9. "scripts/HEK293/9_comparison_with_TTseq_PROseq_NETseq"

10. "scripts/HEP3B/1_prepare_ATACseqPeaks_and_FANTOM5_forMetagenes"

  • Prepare Anna's ATAC-Seq GSE172053 peaks and FANTOM5 enhancers data in a similar way to the genome annotation preparation in HEK293 cells to get clean/non-intersecting regions for a reliable metagene representation

11. "scripts/HEP3B/2_prepare_SNUseq_ChIPseq_data"

12. "scripts/HEP3B/3_signal_presence_in_ATACpeak_and_FANTOM5_regions"

  • Determine the SNU-seq, H3K27ac and H3K4me3 signal and quantify the distribution of them on the ATAC-Seq peaks and FANTOM5 regions

13. "scripts/HEP3B/4_metagene_analysis"

  • Sort the ATAC-Seq peaks and FANTOM5 annotations accordingly, and compute and plot metagenes using deepTools

Dependencies & Environments

This pipeline utilizes multiple Conda environments to manage dependencies for different stages of the analysis. All environment configuration files are located in the envs/ directory.

Environment List

Analysis Stage Environment File Key Tools
DeepTools envs/deeptools_env.yml DeepTools, Python 3.8
QC aggregation envs/multiqc_env.yml MultiQC
Genomic-A flagging envs/py27.yml Python 2.7 scripts
Splicing analysis envs/spliceQ_env.yml SPLICE-q

Installation

To replicate a specific environment, use the following command structure:

# Example: Creating the DeepTools environment (via mamba, it is directly replacable with "conda")
mamba env create -f envs/deeptools_env.yml

# Activate the environment (via mamba, it is directly replacable with "conda")
mamba activate deeptools_env

R Dependencies

R packages are included within the respective conda YAML files where possible. For most of the R and package versions used in the scripts, please refer to envs/R_versions.txt.

References

  1. manschmi/MexNab_3seq
  2. manschmi/MS_Metagene_Tools
  3. nf-core/rnaseq
  4. See the used R packages and other tools in the scripts and under the envs/ directory.