Skip to content

Commit e09d4bf

Browse files
authored
Merge pull request #5 from YeoLab/containerize
Containerize
2 parents f168e1f + 8ca3218 commit e09d4bf

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+822
-859
lines changed

README.md

Lines changed: 54 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -1,53 +1,51 @@
1-
# oligoCLIP: Antibody barcoded eCLIP(ABC) processing pipeline from fastq.gz to windows and motifs
2-
- [original ABC paper](https://www.nature.com/articles/s41592-022-01708-8): use `snakeABC_SE.smk`
1+
# Mudskipper: Multiplex CLIP processing pipeline from fastq.gz to binding sites and motifs
2+
- [Link to original ABC paper](https://www.nature.com/articles/s41592-022-01708-8): use `snakeABC_SE.smk`
33
- Yeolab paired-end protocol: use `snakeOligoCLIP_PE.smk`
44

55
# Installation
6-
- You need to have Snakemake:
7-
- snakemake instructions [here](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html)
8-
- install snakemake 7.3.8 from this yaml file `rules/envs/snakemake.yaml`
9-
- Yeolab internal users: `module load snakemake/7.3.8`
6+
- Main environment: Snakemake 7.3.8 and scipy:
7+
- [Snakemake Installation](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html)
8+
- install snakemake 7.3.8 using `rules/envs/snakemake.yaml`.
9+
- Snakemake 8 has different command line options that will need modification in `--profile`
10+
- Singularity 3.11: [Singularity](https://docs.sylabs.io/guides/3.0/user-guide/build_a_container.html).
11+
- If you are on a server, ask the sys admin to install it. Sometimes there are weird permission issue if you install on your own.
12+
- Not recommended: [install via conda](https://anaconda.org/conda-forge/singularity)
1013
- Download this repository by `git clone https://github.com/YeoLab/Mudskipper.git`.
11-
- Download depending repository and modify config variables as follow: # TODO: containerize or make to snakemake hub
12-
- Yeolab internal users don't need to.
13-
- Install skipper dependecies and modify the following config variables:`JAVA_PATH`,`UMICOLLAPSE_PATH`, `R_EXE`. # TODO: containerize
14-
- follow [skipper instructions](https://github.com/YeoLab/skipper#prerequisites) to set up
15-
- Most dependencies are already specified in `rules/envs`. When running snakemake, using `--use-conda` should automatically install everything for you.
16-
17-
18-
# How to run.
19-
1. prepare `PATH_TO_YOUR_CONFIG`. See below and `config/preprocess_config/oligope_iter5.yaml`
20-
2. Run snakemake
21-
```
22-
snakemake -s snakeOligoCLIP_PE.smk \
23-
-j 12 \
24-
--cluster "qsub -l walltime={params.run_time} -l nodes=1:ppn={params.cores} -q home-yeo -e {params.error_out_file} -o {params.out_file}" \
25-
--configfile PATH_TO_YOUR_CONFIG \
26-
--use-conda \
27-
--conda-prefix /home/hsher/snakeconda -npk
28-
```
29-
- `-s`: use `snakeOligoCLIP_PE.smk` if you did YeoLab internal pair-end protocol. use `snakeABC_SE.smk` if you did ABC
30-
- `--configfile`: yaml file to specify your inputs, including where are the fastqs, what are the barcode, what reference genome...etc.
31-
- the rest just snakemake command line options. [see documentation](https://snakemake.readthedocs.io/en/stable/executing/cli.html)
32-
- `-j`: number of jobs to run at a same time
33-
- `--cluster`: command to submit jobs to cluster.
34-
- `--use-conda`: ask snakemake to install everything for you using conda
35-
- `--conda-prefix`: specify a fixed location to store conda envs to prevent snakemake installing them multiple times
36-
- `-n`: dry run.
37-
- `-k`: keep going even if something failed
38-
- `-p`: print out command
39-
40-
# Config
41-
42-
## Basic Inputs:
14+
15+
16+
# How to run. (Using ABC as an example)
17+
1. Download data from [SRA](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE205536)
18+
2. Prepare config and manifest `PATH_TO_YOUR_CONFIG`. Example inputs:
19+
- config file: `config/preprocess_config/oligose_k562.yaml`
20+
- manifest: `config/fastq_csv/ABC_2rep.csv`
21+
- barcode csv: `config/barcode_csv/ABC_barcode.csv`
22+
3. Adjust profile for your cluster and computing resource:
23+
- see profiles/tscc2 as an example
24+
- for each option, [see documentation](https://snakemake.readthedocs.io/en/stable/executing/cli.html)
25+
4. Run snakemake
26+
```
27+
snakemake -s snakeABC_SE.smk \
28+
--configfile config/preprocess_config/oligose_k562_noalt_smalltest.yaml \
29+
--profile profiles/tscc2 \
30+
-n
31+
```
32+
- `-s`: use `snakeOligoCLIP_PE.smk` if you did YeoLab internal pair-end protocol. use `snakeABC_SE.smk` if you did ABC
33+
- `--configfile`: yaml file to specify your inputs, including where are the fastqs, what are the barcode, what reference genome...etc.
34+
- `-n`: dry run.
35+
- the rest of the options are in `--profile`. Adjust as needed. [see documentation](https://snakemake.readthedocs.io/en/stable/executing/cli.html)
36+
37+
38+
Follow the below sections to understand what to write in your config.
39+
# Options for Input files
4340
- Multiplex Example:
4441
- Yeo lab internal pair-end protocol: `config/preprocess_config/oligope_iter5.yaml`
4542
- ABC single-end protocol: `config/preprocess_config/ABC_2rep.yaml`
4643
- Singleplex Example:
4744
- ABC single-end protocol: `config/preprocess_config/oligose_single_rbfox2_hek.yaml`
4845
- Yeo lab internal paired-end protocol: /home/hsher/projects/oligoCLIP/config/preprocess_config/oligope_v5_nanos2.yaml
4946
- Process 1 type of singleplex per 1 manifest.
50-
### `MANIFEST`: a csv specifying fastq locations, replicates
47+
48+
## `MANIFEST`: a csv specifying fastq locations, replicates
5149
- Example:
5250
- Multiplex Example:
5351
- Yeo lab paired-end: `config/fastq_csv/katie_pe_iteration5.csv`
@@ -61,7 +59,8 @@ snakemake -s snakeOligoCLIP_PE.smk \
6159
- `fastq1`&`fastq2`: *.fastq.gz file for read1 and read 2
6260
- `libname`: unique names for each library. Should not contain space, special characters such as #,%,*
6361
- `experiment`: unique names for experiment. **Rows with the same `experiment` will be treated as replicates.** Should not contain space, special characters such as #,%,*
64-
### `barcode_csv`: specifying barcode sequencing per Antibody/RBP
62+
63+
## `barcode_csv`: specifying barcode sequencing per Antibody/RBP
6564
- Example: `config/barcode_csv/iter5.csv`
6665
- Notebook to generate this file (Yeolab internal user): `utils/generate barcode-iter5.ipynb`
6766
- delimiter: `:`
@@ -71,54 +70,51 @@ snakemake -s snakeOligoCLIP_PE.smk \
7170
- ABC: read starts with this sequence.
7271
- 2nd column: Antibody/RBP name, Should not contain space, special characters such as #,%,*.
7372
74-
### Outputs
73+
# Options to Control Output
7574
- `WORKDIR`: output directory
7675
- `RBP_TO_RUN_MOTIF`: list of RBP names to run motif analysis. Must be one of the rows in `barcode_csv`.
7776
- `run_clipper`: True if you want CLIPper outputs (works, but slow)
7877
- `run_skipper`: True if you want to run Skipper. (usually doesn't work in ABC)
7978
- `run_comparison`: True if you want to run Piranha
8079
- debug: True if you want to debug. This tries to blast the unmapped reads.
8180
82-
### Choosing backgrounds
81+
# Options to Choose Backgrounds
8382
By default if the below are left blank, we run Dirichlet Multinomial Mixture(DMM) for multiplex datasets, where RBPs are explicitly compared with each other. DMM is the best model for multiplex dataset.
8483
8584
Unfortunately, DMM doesn't work for singleplex. Calling singleplex binding sites require "external control" (see below). Otherwise it will just stop at the read counting stage.
8685
8786
But if you want to add an background library, here is how to do:
88-
#### "Internal control": a barcode that measures the background. They are in the same `fastq.gz`
87+
88+
## "Internal control": a barcode that measures the background. They are in the same `fastq.gz`
8989
- `AS_INPUT`: if you have a IgG antibody that everything will normalize against, type its name here. Must be one of the rows in `barcode_csv`. This can the background for skipper, CLIPper, and beta-binomial mixture model
90-
#### "External control": a library that is NOT in the same fastq as your oligoCLIP/ABC
90+
91+
## "External control": a library that is NOT in the same fastq as your oligoCLIP/ABC
9192
- specify them in `external_bam` with name of the library (first line, ex `oligoCLIP_ctrlBead_rep2`), followed by `file:` and `INFORMATIVE_READ`
92-
```
93-
# For example:
94-
oligoCLIP_ctrlBead_rep2:
95-
file: /home/hsher/scratch/oligo_PE_iter7/1022-Rep2/bams/ctrlBead.rmDup.Aligned.sortedByCoord.out.bam
96-
INFORMATIVE_READ: 1
97-
```
93+
```
94+
# For example:
95+
oligoCLIP_ctrlBead_rep2:
96+
file: /home/hsher/scratch/oligo_PE_iter7/1022-Rep2/bams/ctrlBead.rmDup.Aligned.sortedByCoord.out.bam
97+
INFORMATIVE_READ: 1
98+
```
9899
- This can be an eCLIP SMInput, total RNA-seq, IgG pull down from another experiment, bead control, spike-ins
99100
- these will also be used as a background in skipper, CLIPper and beta-binomial mixture model
100101
- the bams must be processed with the exact same STAR index as `STAR_DIR`, and is recommended to be processed with the same/similar mapping parameters as this repo or skipper.
101102
102103
103-
104-
## Dependencies:
105-
- `SCRIPT_PATH`: Absolute path to `scripts` folder.
106-
- `JAVA_PATH`,`UMICOLLAPSE_PATH`, `R_EXE`: skipper dependencies. See `Installation`.
107-
108-
## Preprocessing options:
104+
# Preprocessing Options:
109105
- `adaptor_fwd`,`adaptor_rev`: adapter sequence to trim. Do not include barcode
110106
- `tile_length`: we tile adapter sequences of this length so that indels don't mess up with trimming
111107
- `QUALITY_CUTOFF`: default 15. cutadapt params
112108
- `umi_length`: Length of unique molecular identifier (UMI).
113109
- `STAR_DIR`: directory to STAR index
114110
115-
## Annotations:
111+
# Annotation Options:
116112
- skipper annotations: [follow skipper instructions](https://github.com/YeoLab/skipper#prerequisites) or generate with [skipper_utils](https://github.com/algaebrown/skipper_utils)
117113
- Yeolab internal users: Brian had all sorts of annotations here `/projects/ps-yeolab4/software/skipper/1.0.0/bin/skipper/annotations/`.
118114
- `CHROM_SIZES`
119115
- `GENOMEFA`
120116
121-
# Output Files
117+
# Output files
122118
## Trimmed fastqs, bams, bigwigs:
123119
These are in the `EXPERIMENT_NAME` folders. For example, in your manifest.csv, there are two experiments, "GN_1019" and "GN_1020", then, under the `GN_1019/` folder you would see the following:
124120
1. `fastqs`: The trimmed and the demultiplexed fastqs.

generate_output.py

Lines changed: 23 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,9 @@ def skipper_outputs():
6161
libname = libnames,
6262
sample_label = config['RBP_TO_RUN_MOTIF'],
6363
signal_type = ['CITS', 'COV']
64+
)+expand("skipper_CC/enriched_re/{libname}.{sample_label}.enriched_re.tsv.gz",
65+
libname = libnames,
66+
sample_label = list(set(rbps)-set(config['AS_INPUT']))
6467
)
6568
# normalize to external bams
6669
if external_normalization:
@@ -69,6 +72,11 @@ def skipper_outputs():
6972
external_label = list(external_normalization.keys()),
7073
libname = libnames,
7174
clip_sample_label = list(set(rbps)-set(config['AS_INPUT']))
75+
)+expand("skipper_external/{external_label}/homer/finemapped_results/{signal_type}/{libname}.{clip_sample_label}/homerResults.html",
76+
external_label = list(external_normalization.keys()),
77+
libname = libnames,
78+
clip_sample_label = config['RBP_TO_RUN_MOTIF'],
79+
signal_type = ['CITS', 'COV']
7280
)
7381
return outputs
7482

@@ -146,7 +154,15 @@ def DMN_outputs():
146154
libname = libnames,
147155
)+expand('mask/{libname}.repeat_mask.csv',
148156
libname = libnames,
157+
)+expand("DMM_repeat/{repeat_type}/{libname}.{sample_label}.enriched_windows.tsv",
158+
sample_label = rbps,
159+
repeat_type = ['name'],
160+
libname = libnames
161+
)+expand("DMM_repeat/{repeat_type}/{libname}.megaoutputs.tsv",
162+
libname = libnames,
163+
repeat_type = ['name']
149164
)
165+
150166

151167
return outputs
152168

@@ -161,7 +177,7 @@ def clipper_outputs():
161177
libname = libnames
162178
)
163179
# complementary control
164-
output+=expand("CLIPper_CC/{libname}.{sample_label}.peaks.normed.compressed.annotate.bed",
180+
outputs+=expand("CLIPper_CC/{libname}.{sample_label}.peaks.normed.compressed.annotate.bed",
165181
sample_label = list(set(rbps)-set(config['AS_INPUT'])),
166182
libname = libnames
167183
)+expand("CLIPper_CC/{libname}.{sample_label}.peaks.normed.compressed.motif.svg",
@@ -178,13 +194,15 @@ def clipper_outputs():
178194
return outputs
179195

180196
def comparison_outputs():
181-
outputs = expand("comparison/piranha/CC/{libname}.{sample_label}.bed",
197+
outputs = expand("comparison/piranha/{bg}/{libname}.{sample_label}.bed",
182198
libname = libnames,
183199
sample_label =list(set(rbps)-set(config['AS_INPUT'])),
184-
)+expand("comparison/pureclip/{libname}.{sample_label}.bind.bed",
185-
libname = libnames,
186-
sample_label = list(set(rbps)-set(config['AS_INPUT']))
200+
bg = ['CC', 'nobg']
187201
)
202+
# )+expand("comparison/pureclip/{libname}.{sample_label}.bind.bed",
203+
# libname = libnames,
204+
# sample_label = list(set(rbps)-set(config['AS_INPUT']))
205+
# ) # very slow to run
188206
# )+expand("comparison/omniCLIP/output/{libname}.{sample_label}.omniclip_done.txt",
189207
# libname = libnames,
190208
# sample_label = list(set(rbps)-set(config['AS_INPUT']))

profiles/tscc2/config.yaml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
verbose: true
2+
notemp: true
3+
latency: 60
4+
printshellcmds: true
5+
skip-script-cleanup: true
6+
nolock: true
7+
keep-going: true
8+
cluster: "sbatch -t {params.run_time} -e {params.error_out_file} -o {params.out_file} -p condo -q condo -A csd792 --mem {params.memory} --tasks-per-node {params.cores} -J {rule}"
9+
use-singularity: true
10+
singularity-args: "--bind /tscc"
11+
singularity-prefix: /tscc/nfs/home/hsher/scratch/singularity
12+
use-conda: true
13+
conda-prefix: "/tscc/nfs/home/hsher/snakeconda"
14+
conda-frontend: conda
15+
jobs: 30

profiles/tscc2_single/config.yaml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
verbose: true
2+
notemp: true
3+
latency: 60
4+
printshellcmds: true
5+
skip-script-cleanup: true
6+
nolock: true
7+
keep-going: true
8+
use-singularity: true
9+
singularity-args: "--bind /tscc/projects --bind /tscc/nfs/home/hsher/scratch --bind /tscc/nfs/home/hsher/snakeconda"
10+
singularity-prefix: /tscc/nfs/home/hsher/scratch/singularity
11+
use-conda: true
12+
conda-prefix: "/tscc/nfs/home/hsher/snakeconda"
13+
conda-frontend: conda
14+
jobs: 8

0 commit comments

Comments
 (0)