Skip to content

Configuration File

Dakota Hawkins edited this page Aug 30, 2018 · 9 revisions

Introduction

As scPipe uses SnakeMake as its workflow manager. To configure runs, SnakeMake generally requires a configuration file: scPipe requires two. The first configuration file, config.yaml (found in the files subdirectory) points to a run-specific configuration file, and specifies the desired output from the current run. This allows users to have multiple configuration files for different runs, without the need to continually re-write large configuration files, or point SnakeMake to different files. The expected format for both this "master" configuration file, and the run-specific configuration file are detailed below.

Master Configuration File

The master configuration file for scPipe should be a .yaml file with the name config.yaml. It should be located in the files subdirectory from the top of the directory. The file expected two entries:

  • config

    • File path pointing to a run-specific configuration file. Run specific configuration files should also be .yaml files; however, as their location is user-specified, they can be located wherever you like.

    • Example:

      • config: "/home/dakota/scPipe/files/pilot_data_config.yaml"
  • output

    • Keyword denoting desired output from run. The keyword specifies how much of the workflow to execute before stopping. This is useful when you have multiple sets of fastq files from different datasets, wish to process/align them separately, and then combine the data at the count matrix level.

    • Example:

      • output: "count_matrix"
    • Keywords:

      • alignment: run qc and alignment steps. Final output will be a read count file for each individual sample.
      • count_matrix: combine single counts files into a count matrix. Final output will be a .csv file of the matrix.
      • multiqc: run multiqc for qc, alignment, and quantification steps. Output is a directory containing multiqc output.
      • metadata: extract metadata, using user-provided regular expressions, from sample names in the count matrix.
      • combined: combine count matrices from different alignment runs.
      • filtered: quality controlled count matrix, with empty cells and lowly expressed genes filtered out.
      • normalized: count matrix with values normalized between samples with batch effects removed.
      • imputed: count matrix with imputed zero values. This is the last step of the pipeline, and will result in the same output as passing 'final'
      • final: run through the entire pipeline.

Run-specific Configuration File

The run-specific configuration file contains largely dataset-dependent parameters required by many of the tools used within the pipeline. The file should a .yaml file, though because the master configuration file points directly to it, you can place it anywhere on your computer. Many parameters are nested within groups to make the file less unwieldy.

dirs

The key entry in the run-specific configuration file contains all the directories required for scPipe to run.

dirs:
  data:
  output:
  log:
  genome:
  matrices:
  • data
    • Directory containing sample sub-directories with fastq files.
    • Example
      • data: "/home/dakota/scPipe/data/fastq"
  • output
    • Directory to hold output generated by scPipe.
    • Example
      • output: "/home/dakota/scPipe/output/"
  • log_dir
    • Directory to hold log files generated by scPipe.
    • Example
      • log: "/home/dakota/scPipe/logs/"
  • genome
    • Directory that either holds, or will hold, genome index files generated by STAR.
    • Example
      • genome: "/home/dakota/data/star"
  • matrices
    • List of directories holding previously generated count matrices. It's assumed each directory contains a single count matrix generated by scPipe with the name 'count_matrix.csv'
    • Example
      • matrices: ["/home/dakota/scPipeOut/matrix"]

files

Required files for scPipe.

files:
  gtf:
  genome_fasta:
  • gtf
    • Genome annotation file in .gtf format.
    • Example:
      • "gtf: /home/dakota/scPipeOut/files/annotations.gtf"
  • genome_fasta
    • Genome fasta file containing all genomic sequences.
    • Example:
      • genome_fasta: "/home/dakota/scPipeOut/files/genome.fa"

dataset

Dataset dependent parameters.

dataset:
  read_length:
  id:
  ends:
  • read_length
    • Length of reads. If paired-end reads, value should be $2 * |seq length|$ from a single sample .fastq file.
    • Example:
      • read_length: 152
  • id
    • Identification string for current dataset.
    • Example:
      • id: 2018-07-01
  • ends
    • Nested key-value pairs that denote pair direction in fastq file names.
    • Example:
      • ends:
          R1: forward
          R2: backward
        

params

Command-line parameters for command-line tools used in scPipe (i.e. fastp, STAR, and featureCounts)

params:
  fastp: ''
  star_genome: ''
  star_align: ''
  featureCount: ''
  • fastp
    • Optional arguments to pass fastp. See fastp documentation for specifics.
    • Example
      • fastp: '-y -Y 30'
  • star_genome
    • Optional arguments during genome indexing with STAR. See STAR documentation for specifics.
    • Example
      • star_genome: '--genomeChrBinsNBits 13'
  • star_align
    • Optional arguments during read alignment with STAR. See STAR documentation for specifics.
    • Example
      • star_align: '--runThreadN 2'
  • featureCount
    • Optional arguments during read quantification with featureCounts. See Subread documentation
    • Example
      • featureCount: '' (passing no additional arguments is fine)

flags

Flags determining whether to run specific parts of the pipeline/calculate specific arguments.

flags:
  star_est_ChrBinsNBits:
  combine_data:
  • star_est_ChrBinsNBits
    • Whether to calculate the parameter --genomeChrBinsNBits in STAR genome indexing. If yes, do not supply the argument yourself in the params section.
    • Example
      • star_est_ChrBinsNBits: True
  • combine_data
    • Whether to combine count matrices from separate runs with the count matrix generated by the current run. If yes, directories should be supplied under the [dirs][matrices] argument. Combined data will be saved under the expected /outputdir/matrix/count_matrix.csv file, with an extra /outputdir/matrix/original_count_matrix.csv being generated to hold un-combined data.
    • Example
      • combine_data: True

sample_regex

Regular expressions necessary to extract sample-level information from file names.

sample_regex:
  id:
  <metadata regex>
    batch:
    <optional>
  • id
    • Regular expression that will capture end of a sample id (e.g. a file with the name <sample_id>_8.fastq could have a regular expression '_[0-9].fastq$'. The sample id can then be extracted by taking the substring from the start of the filename to the first matching position.
    • Example:
      • 'id: '_[0-9]*.$' (As of August 30, 2018, the single quotes are necessary for pyaml to correctly parse the regular expression.
  • batch
    • Required key-value pair of metadata regular expressions matching batch ids to sample ids. Assumes that if the pattern in present in the sample filename, the sample is from the matching batch.
    • Example
      • batch:
          1: '-1-'
          2: '-2-'
          3: '-3-'
        
  • Optional Metadata Regular Expressions
    • Following the key-value pattern, users can include any other regular expressions to extract sample-level information from filenames (e.g. treatment, time points, etc.). scPipe can handle an arbitrary number of entries, so long as the patterns exist within the filenames.
    • Example
      • treatment:
          ASW: '^ASW'
          Chlorate: '^Chlorate'
        timepoint:
          '18hpf': ''
        

Thresholds

Threshold values used during count matrix quality control

thresholds:
  bad:
  cpm:
  cells:
  • bad
    • Minimum number of mapped reads per cell. Cells below this value will be filtered from the count matrix.
    • Example
      • bad: 50000
  • cpm
    • Minimum number of reads in counts-per-million for a gene to be considered 'extremely low'
    • Example:
      • cpm: 0.5
  • cells
    • The minimum number of cells that must exhibit greater than 'extremely low' expression for a gene to be kept. That is, assuming a threshold 2, if 3 cells express gene A with counts-per-million greater than the threshold [thresholds][cpm], gene A will be kept. While if only 1 cell expresses gene B above this threshold, gene B will be filtered from the final count matrix.
    • Example:
      • cells: 2

Clone this wiki locally