IMPORTANT NOTE: MPRAflow will not developed further! Please use MPRAsnakeflow for a more advanced pipeline with lots of improvements and fixes.
This pipeline processes sequencing data from Massively Parallel Reporter Assays (MPRA) to create count tables for candidate sequences tested in the experiment.
Check out MPRAflow's details documentation here
NOTE: MPRAflow cannot analyze STARR-seq data. Have a look at the documentation to see some MPRA examples.
This package contains three utilities:
This utility takes in library association sequencing data (FASTQ) and a design file (FASTA) to assign barcodes to the corresponding elements tested. Functionality includes filtering for quality and coverage of barcodes. This utility must be run before the COUNT utility.
This utility processes sequence data (FASTQ) of barcodes from the DNA and RNA fractions of the MPRA experiment and outputs count tables labeled with the element tested and a label provided in the design file. This utility can process multiple replicates and conditions in a parallelized manner. Based on a user specified flag, the pipeline will either output normalized activity for each tested sequence, or will combine the results into a single count matrix compatible with MPRAnalyze.
This workflow is about assocation variant calls with barcodes. Variants are introduced by an error-prone PCR. The workflow takes the sequencing of the region, with barcodes in index read and the reference sequence and maps the reads to the reference, calls variants and associates them with the corresponding barcode.
This workflow is about getting single variant effect from a target with multiple mutations generated by error-prone PCR. The workflow takes counts (e.g. from the count workflow), combines them with an association file (variants to barcodes) and uses a generalized linear model to to detect single variant effects.
- conda
Download here: https://docs.conda.io/en/latest/miniconda.html
git clone https://github.com/shendurelab/MPRAflow.gitThis pipeline uses python2.7 and python3.6 and is set up to run on a Linux system. Two .yml files are provided to create the appropriate environments. The general environment with nextflow located in the home directory called environment.yml and a specific python 2.7 environment in the conf folder: mpraflow_py27.yml.
The different environments are handled internally by nextflow. Therefore your compute node, where you start MPRAflow, have to have access to the internet.
Install the the conda environment. The general conda environment is called MPRAflow.
cd MPRAflow
conda env create -n MPRAflow -f environment.ymlIf you do not have access to the internet, you have to run the previous command on a node with internet. Afterwards you need to start nextflow too (see Steps to run the pipeline). After creation of the second conda environment by nextflow you can cancel it and start it on your internal node. Be aware that folders must have access on all nodes.
Nextflow has problems using conda 4.7 and highet, because the source activate command is replaced by conda activate. If you get error messages after running you can make a symbolik link of the activate command from you bin folder of the conda or miniconda folder to your MPRAflow environment bin folder. E.g. like:
ln -s ~/miniconda3/bin/activate ~/miniconda3/envs/MPRAflow/bin/activateThis pipeline comes with a conf/cluster.config file set up to run on HPC clusters, allowing each process to be run as a separate qsub, sbatch or similar command. The config contains example code for SGE, LSF, and SLURM architectures. The default is SGE.
Please remove the \\ for the architecture you would like to use and place \\ in front of any architectures not currently in use. A '\' in front of all of them runs the pipeline on your local machine. If you run MPRAflow on a cluster system make sure be that you export all environment variables. E.g. this can be done with the -V option by SGE.
NOTE: Please consult your cluster's wiki page for cluster specific commands and change clusterOptions = to reflect these specifications. Additionally, for large libraries, more memory can be specified in this location.
Please use a submit script for steps 2 and 3. For full details of mandatory and optional arguments run:
conda activate MPRAflow
nextflow run count.nf --help
nextflow run association.nf --helpThis pipeline expects the FASTQ files to be demultiplexed and trimmed to only include sequence from the insert, barcodes, and/or UMIs.
-
Create an 'experiment' csv in the format below, including the header.
DNA_R1orRNA_R1is name of the gzipped fastq of the forward read of the DNA or RNA from the defined condition and replicate.DNA_R2orRNA_R2is the corresponding index read with UMIs (excluding sample barcodes) andDNA_R3orRNA_R3of the reverse read. If you do not have UMIs remove the columnsDNA_R2andRNA_R2or leave them empty.Condition,Replicate,DNA_BC_F,DNA_UMI,DNA_BC_R,RNA_BC_F,RNA_UMI,RNA_BC_R condition1,1,cond1_rep1_DNA_FWD_reads.fastq.gz,cond1_rep1_DNA_IDX_reads.fastq.gz,cond1_rep1_DNA_REV_reads.fastq.gz,cond1_rep1_RNA_FWD_reads.fastq.gz,cond1_rep1_RNA_IDX_reads.fastq.gz,cond1_rep1_RNA_REV_reads.fastq.gz condition1,2,cond1_rep2_DNA_FWD_reads.fastq.gz,cond1_rep2_DNA_IDX_reads.fastq.gz,cond1_rep2_DNA_REV_reads.fastq.gz,cond1_rep2_RNA_FWD_reads.fastq.gz,cond1_rep2_RNA_IDX_reads.fastq.gz,cond1_rep2_RNA_REV_reads.fastq.gz condition2,1,cond2_rep1_DNA_FWD_reads.fastq.gz,cond2_rep1_DNA_IDX_reads.fastq.gz,cond2_rep1_DNA_REV_reads.fastq.gz,cond2_rep1_RNA_FWD_reads.fastq.gz,cond2_rep1_RNA_IDX_reads.fastq.gz,cond2_rep1_RNA_REV_reads.fastq.gz condition2,2,cond2_rep2_DNA_FWD_reads.fastq.gz,cond2_rep2_DNA_IDX_reads.fastq.gz,cond2_rep2_DNA_REV_reads.fastq.gz,cond2_rep2_RNA_FWD_reads.fastq.gz,cond2_rep2_RNA_IDX_reads.fastq.gz,cond2_rep2_RNA_REV_reads.fastq.gz -
If you would like each insert to be colored based on different user-specified categories, such as "positive control", "negative control", "shuffled control", and "putative enhancer", to assess the overall quality the user can create a 'label' tsv in the format below that maps the name to category:
insert1_name insert1_label insert2_name insert2_labelThe insert names must exactly match the names in the design FASTA file.
-
Run Association if using a design with randomly paired candidate sequences and barcodes
conda activate MPRAflow nextflow run association.nf --fastq-insert "${fastq_prefix}_R1_001.fastq.gz" --design "ordered_candidate_sequences.fa" --fastq-bc "${fastq_prefix}_R2_001.fastq.gz"
NOTE: This will run in local mode, please submit this command to your cluster's queue if you would like to run a parallelized version.
-
Run Count
conda activate MPRAflow nextflow run count.nf --dir "bulk_FASTQ_directory" --e "experiment.csv" --design "ordered_candidate_sequences.fa" --association "dictionary_of_candidate_sequences_to_barcodes.p"
Be sure that the
experiment.csvis correct. All fastq files must be in the same folder given by the--diroption. If you do not have UMIs please use the option--no-umi. Please specify your barcode length and umi-length with--bc-lengthand--umi-length. -
Run association saturation mutagenesis
conda activate MPRAflow nextflow run association_saturationMutagenesis.nf --fastq-insert SRR8646911_1.fastq.gz --fastq-insertPE SRR8646911_2.fastq.gz --fastq-bc SRR8646911_3.fastq.gz --design TERT.fa --name TERT --outdir out --bc-length 20
-
Run saturation mutagenesis
conda activate MPRAflow nextflow run saturationMutagenesis.nf --dir "directory_of_DNA/RNA_counts" --e "satMutexperiment.csv" --assignment "yourSpecificAssignmentFile.variants.txt.gz"
Note The experiment file is different from the count workflow. It should contain the condition, replicate and filename of the counts, like:
Condition,Replicate,COUNTS condition1,1,cond1_1_counts.tsv.gz condition1,2,cond1_2_counts.tsv.gz condition1,3,cond1_3_counts.tsv.gz condition2,1,cond2_1_counts.tsv.gz condition2,2,cond2_2_counts.tsv.gz condition2,3,cond2_3_counts.tsv.gzThe count files can be generated by the count workflow, are named:
<condition>_<replicate>_counts.tsv.gzand can be found in theouts/<condition>/<replicate>folder. They have to be copied or linked into the--dirfolder.
