Skip to content

A pipeline for aligning and peak calling data from ChIP-seq sequencing

Notifications You must be signed in to change notification settings

kernco/chipseq-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ChIP-Seq Pipeline

Introduction

This pipeline generates a set of peak calls for histone modification marks from raw sequencing data. It can be run on multiple marks simultaneously.

Input Files

The following files are required for input to the pipeline:

* Histone modification read file (gzipped fastq format)
* Control/Input read file (gzipped fastq format)
* Reference genome assembly (fasta format)

Multiple read files for different histone marks can be used by the pipeline and the same set of control reads will be used. While the FAANG consortium guidelines only require single-end reads for ChIP-seq assays, the pipeline can be run with a paired-end option if you have paired-end data.

The reference genome assembly can be downloaded from Ensembl, NCBI, etc. using the wget command. Below are example commands to download various genomes from Ensembl.

# Chicken
wget ftp://ftp.ensembl.org/pub/release-83/fasta/gallus_gallus/dna/Gallus_gallus.Galgal4.dna.toplevel.fa.gz
# Cattle
wget ftp://ftp.ensembl.org/pub/release-83/fasta/bos_taurus/dna/Bos_taurus.UMD3.1.dna.toplevel.fa.gz
# Pig
wget ftp://ftp.ensembl.org/pub/release-83/fasta/sus_scrofa/dna/Sus_scrofa.Sscrofa10.2.dna.toplevel.fa.gz

The assembly file must be unzipped:

gunzip Gallus_gallus.Galgal4.dna.toplevel.fa.gz
gunzip Bos_taurus.UMD3.1.dna.toplevel.fa.gz
gunzip Sus-scrofa.Sscrofa10.2.dna.toplevel.fa.gz

Installation and Requirements

Requirements

The following tools must be installed on your machine:

If you are running this on a cluster using the SLURM scheduler, a script is included to submit this pipeline as a job, loading the required software modules. You will have to contact your system administrator to ensure they are installed.

Installation

You will need to install the git tool on system if it's not already installed.

While in the directory with your data, download the pipeline scripts:

git clone https://github.com/kernco/chipseq-pipeline.git

Alternatively, you can download a zip file from the Github project page and unzip it.

Running the Pipeline

The pipeline is executed with the command

./chipseq-pipeline/run_pipeline.sh [args]

from the directory containing your data, or if you're running on a SLURM cluster:

sbatch [sbatch-args] ./chipseq-pipeline/slurm_script.sh [args]

The [sbatch-args] will depend on the configuration of your cluster. Contact your system administrator if you're unsure what arguments are required.

The pipeline takes the following arguments:

    --broad             Filenames without the .fq.gz to do broad peak calling
                        on. For example, '--broad H3K27me3' will assume a
                        file H3K27me3.fq.gz exists in the current directory.
                        Multiple files/marks can be specified by separating
                        them by commas.
    --narrow            Same as --broad but does narrow peak calling.
    --input             Similar to --broad and --narrow, specifying the
                        control reads.
    --assembly          The filename of the reference genome assembly.
    --genome_size       The size of the mappable genome. We use 1.047e9 for
                        chicken (Galgal4). Generally, this is the number of
                        bases in non-repeat regions of the genome. A good
                        place to get this number is from the "Golden Path
                        Length" listed on the Ensembl page for the assembly
                        (e.g. http://www.ensembl.org/Gallus_gallus/Info/Annotation).
    --paired_end        Set this flag to run the pipeline with paired-end
                        reads. When this flag is set, the pipeline will expect
                        x.R1.fq.gz and x.R2.fq.gz for all x specified in the
                        --broad, --narrow, and --input arguments.

Output Files

A number of files are generated by programs run by the pipeline. The most useful ones are listed here.

* X_peaks.broadPeak - A BED file containing broad peaks identified.
* X_peaks.narrowPeak - A BED file containing narrow peaks identified.
* X.filtered.bam - The read alignments with duplicates marked and low
                   quality alignments removed.

Example command

We have a directory with four read files from 3 histone modification marks and the control reads from chicken, as well as the chicken genome file:

* H3K4me3.fq.gz
* H3K27me3.fq.gz
* H3K4me1.fq.gz
* control.fq.gz
* Galgal4.fa

We want to do broad peak calling on the H3K27me3 marks and narrow peak calling on the H3K4me1 and H3K4me3 marks. To run the pipeline, we execute the following command:

./chipseq-pipeline/run_pipeline.sh --broad H3K27me3 --narrow H3K4me3,H3K4me1 --input control --assembly Galgal4.fa --genome_size 1.047e9

When the pipeline finishes running, the peak calls will be in these output files:

* H3K4me3_peaks.narrowPeak
* H3K4me1_peaks.narrowPeak
* H3K27me3_peaks.broadPeak

There will also be some additional files that were created during the pipeline's execution, such as BAM files containing read alignments.

About

A pipeline for aligning and peak calling data from ChIP-seq sequencing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages