-
Notifications
You must be signed in to change notification settings - Fork 0
Configuration File
As scPipe uses SnakeMake as its workflow manager. To configure runs, SnakeMake generally requires a configuration file: scPipe requires two. The first configuration file, config.yaml (found in the files subdirectory) points to a run-specific configuration file, and specifies the desired output from the current run. This allows users to have multiple configuration files for different runs, without the need to continually re-write large configuration files, or point SnakeMake to different files. The expected format for both this "master" configuration file, and the run-specific configuration file are detailed below.
The master configuration file for scPipe should be a .yaml file with the name config.yaml. It should be located in the files subdirectory from the top of the directory. The file expected two entries:
-
config
-
File path pointing to a run-specific configuration file. Run specific configuration files should also be
.yamlfiles; however, as their location is user-specified, they can be located wherever you like. -
Example:
config: "/home/dakota/scPipe/files/pilot_data_config.yaml"
-
-
output
-
Keyword denoting desired output from run. The keyword specifies how much of the workflow to execute before stopping. This is useful when you have multiple sets of
fastqfiles from different datasets, wish to process/align them separately, and then combine the data at the count matrix level. -
Example:
output: "count_matrix"
-
Keywords:
- alignment: run qc and alignment steps. Final output will be a read count file for each individual sample.
- count_matrix: combine single counts files into a count matrix. Final output will be a .csv file of the matrix.
- multiqc: run multiqc for qc, alignment, and quantification steps. Output is a directory containing multiqc output.
- metadata: extract metadata, using user-provided regular expressions, from sample names in the count matrix.
- combined: combine count matrices from different alignment runs.
- filtered: quality controlled count matrix, with empty cells and lowly expressed genes filtered out.
- normalized: count matrix with values normalized between samples with batch effects removed.
- imputed: count matrix with imputed zero values. This is the last step of the pipeline, and will result in the same output as passing 'final'
- final: run through the entire pipeline.
-
The run-specific configuration file contains largely dataset-dependent parameters required by many of the tools used within the pipeline. The file should a .yaml file, though because the master configuration file points directly to it, you can place it anywhere on your computer. Many parameters are nested within groups to make the file less unwieldy.
The key entry in the run-specific configuration file contains all the directories required for scPipe to run.
dirs:
data:
output:
log:
genome:
matrices:
-
data
- Directory containing sample sub-directories with fastq files.
- Example
data: "/home/dakota/scPipe/data/fastq"
-
output
- Directory to hold output generated by scPipe.
- Example
output: "/home/dakota/scPipe/output/"
-
log_dir
- Directory to hold log files generated by scPipe.
- Example
log: "/home/dakota/scPipe/logs/"
-
genome
- Directory that either holds, or will hold, genome index files generated by STAR.
- Example
genome: "/home/dakota/data/star"
-
matrices
- List of directories holding previously generated count matrices. It's assumed each directory contains a single count matrix generated by scPipe with the name 'count_matrix.csv'
- Example
matrices: ["/home/dakota/scPipeOut/matrix"]
Required files for scPipe.
files:
gtf:
genome_fasta:
-
gtf
- Genome annotation file in
.gtfformat. - Example:
"gtf: /home/dakota/scPipeOut/files/annotations.gtf"
- Genome annotation file in
-
genome_fasta
- Genome fasta file containing all genomic sequences.
- Example:
genome_fasta: "/home/dakota/scPipeOut/files/genome.fa"
Dataset dependent parameters.
dataset:
read_length:
id:
ends:
-
read_length
- Length of reads. If paired-end reads, value should be
$2 * |seq length|$ from a single sample.fastqfile. - Example:
read_length: 152
- Length of reads. If paired-end reads, value should be
-
id
- Identification string for current dataset.
- Example:
id: 2018-07-01
-
ends
- Nested key-value pairs that denote pair direction in
fastqfile names. - Example:
-
ends: R1: forward R2: backward
-
- Nested key-value pairs that denote pair direction in
Command-line parameters for command-line tools used in scPipe (i.e. fastp, STAR, and featureCounts)
params:
fastp: ''
star_genome: ''
star_align: ''
featureCount: ''
-
fastp
- Optional arguments to pass
fastp. See fastp documentation for specifics. - Example
fastp: '-y -Y 30'
- Optional arguments to pass
-
star_genome
- Optional arguments during genome indexing with
STAR. See STAR documentation for specifics. - Example
star_genome: '--genomeChrBinsNBits 13'
- Optional arguments during genome indexing with
-
star_align
- Optional arguments during read alignment with
STAR. See STAR documentation for specifics. - Example
star_align: '--runThreadN 2'
- Optional arguments during read alignment with
-
featureCount
- Optional arguments during read quantification with
featureCounts. See Subread documentation - Example
-
featureCount: ''(passing no additional arguments is fine)
-
- Optional arguments during read quantification with
Flags determining whether to run specific parts of the pipeline/calculate specific arguments.
flags:
star_est_ChrBinsNBits:
combine_data:
-
star_est_ChrBinsNBits
- Whether to calculate the parameter
--genomeChrBinsNBitsinSTARgenome indexing. If yes, do not supply the argument yourself in the params section. - Example
star_est_ChrBinsNBits: True
- Whether to calculate the parameter
-
combine_data
- Whether to combine count matrices from separate runs with the count matrix generated by the current run. If yes, directories should be supplied under the
[dirs][matrices]argument. Combined data will be saved under the expected/outputdir/matrix/count_matrix.csvfile, with an extra/outputdir/matrix/original_count_matrix.csvbeing generated to hold un-combined data. - Example
combine_data: True
- Whether to combine count matrices from separate runs with the count matrix generated by the current run. If yes, directories should be supplied under the
Regular expressions necessary to extract sample-level information from file names.
sample_regex:
id:
<metadata regex>
batch:
<optional>
-
id
- Regular expression that will capture end of a sample id (e.g. a file with the name
<sample_id>_8.fastqcould have a regular expression'_[0-9].fastq$'. The sample id can then be extracted by taking the substring from the start of the filename to the first matching position. - Example:
-
'id: '_[0-9]*.$'(As of August 30, 2018, the single quotes are necessary forpyamlto correctly parse the regular expression.
-
- Regular expression that will capture end of a sample id (e.g. a file with the name
-
batch
- Required key-value pair of metadata regular expressions matching batch ids to sample ids. Assumes that if the pattern in present in the sample filename, the sample is from the matching batch.
- Example
-
batch: 1: '-1-' 2: '-2-' 3: '-3-'
-
-
Optional Metadata Regular Expressions
- Following the key-value pattern, users can include any other regular expressions to extract sample-level information from filenames (e.g. treatment, time points, etc.). scPipe can handle an arbitrary number of entries, so long as the patterns exist within the filenames.
- Example
-
treatment: ASW: '^ASW' Chlorate: '^Chlorate' timepoint: '18hpf': ''
-
Threshold values used during count matrix quality control
thresholds:
bad:
cpm:
cells:
-
bad
- Minimum number of mapped reads per cell. Cells below this value will be filtered from the count matrix.
- Example
bad: 50000
-
cpm
- Minimum number of reads in counts-per-million for a gene to be considered 'extremely low'
- Example:
cpm: 0.5
-
cells
- The minimum number of cells that must exhibit greater than 'extremely low' expression for a gene to be kept. That is, assuming a threshold 2, if 3 cells express gene A with counts-per-million greater than the threshold
[thresholds][cpm], gene A will be kept. While if only 1 cell expresses gene B above this threshold, gene B will be filtered from the final count matrix. - Example:
cells: 2
- The minimum number of cells that must exhibit greater than 'extremely low' expression for a gene to be kept. That is, assuming a threshold 2, if 3 cells express gene A with counts-per-million greater than the threshold