Skip to content

Two bioinformatics workflow pipelines for NGS data processing: Snakemake workflow for demultiplexing and quality control of raw sequencing data, and Nextflow pipeline for single-cell RNA sequencing analysis.

Notifications You must be signed in to change notification settings

sumeetg23/Workflows

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Workflows

This repository contains two bioinformatics workflow pipelines for NGS data processing:

  1. Snakemake-CreateAndQCRawData - A Snakemake workflow for demultiplexing and quality control of raw sequencing data
  2. nextflow-single-cell-pipeline - A Nextflow pipeline for single-cell RNA sequencing analysis

📁 Project Overview

1. Snakemake-CreateAndQCRawData

A Snakemake-based workflow designed for the initial processing and quality control of raw NGS sequencing runs. This pipeline handles:

  • Demultiplexing: Converts BCL files to FASTQ format using bcl2fastq
  • Quality Control: Performs FastQC and FastQ Screen analysis on demultiplexed FASTQ files
  • Report Generation: Creates MultiQC reports summarizing QC metrics on a per-lane basis

Key Features:

  • Automated sample sheet validation
  • Per-lane QC reporting
  • SLURM cluster support with configurable resource allocation
  • Optional adapter trimming with cutadapt
  • Optional RSeQC analysis for RNA-seq data

Workflow Components:

  • bcl2fastq - BCL to FASTQ conversion
  • fastqc - Quality control analysis
  • fastqscreen - Contamination screening
  • multiqc - Aggregate QC reports

2. nextflow-single-cell-pipeline

A Nextflow-based pipeline for comprehensive single-cell RNA sequencing analysis, specifically designed for inDrop protocol data. This pipeline performs:

  • Quality Control: FastQC and FastQ Screen analysis
  • Read Filtering: Optional FASTQ filtering and preprocessing
  • Genome Indexing: STAR genome index generation
  • Alignment: STARsolo alignment for single-cell data
  • Report Generation: MultiQC reports for quality assessment

Key Features:

  • Containerized execution with Singularity
  • Handles multi-read single-cell data (R1/R2/R4)
  • Configurable filtering parameters
  • Automatic STAR index management
  • Comprehensive execution reports and DAG visualization

Pipeline Modules:

  • fastqc.nf - Quality control analysis
  • fastq_screen.nf - Contamination screening
  • filterfastq.nf - FASTQ filtering and preprocessing
  • multiqc.nf - Aggregate QC reports
  • star_index.nf - STAR genome indexing
  • star_solo.nf - STARsolo single-cell alignment

🚀 Quick Start

Snakemake Workflow

Navigate to the Snakemake directory and configure your run:

cd Snakemake-CreateAndQCRawData

Edit config.yaml to specify:

  • Sample sheet path
  • Run folder path
  • BCL2FASTQ parameters
  • Optional cutadapt and RSeQC settings

Run the workflow:

snakemake -s rules/bcl2fastq.smk \
  --cluster-config clusterTime.json \
  --cluster "sbatch --mem={cluster.mem} --cpus-per-task={cluster.cpus_per_task} --account={cluster.account} --partition={cluster.slurm_partition} --output %j.out --wrap" \
  --jobs 100 \
  --latency-wait 60

Nextflow Pipeline

Navigate to the Nextflow directory:

cd nextflow-single-cell-pipeline

Edit nextflow.config to configure:

  • Input directory path
  • Output directory
  • Sample names
  • Reference genome paths (FASTA and GTF)
  • Resource requirements

Run the pipeline:

nextflow run main.nf -profile singularity

📋 Requirements

Snakemake Workflow

  • Snakemake (≥5.19.3)
  • bcl2fastq
  • FastQC
  • FastQ Screen
  • MultiQC
  • SLURM (for cluster execution)
  • Python 3 with pandas

Nextflow Pipeline

  • Nextflow
  • Singularity (for containerized execution)
  • Reference genome files (FASTA and GTF)
  • STAR genome index (optional - can be generated by pipeline)

📂 Directory Structure

Workflows/
├── Snakemake-CreateAndQCRawData/
│   ├── config.yaml              # Main configuration file
│   ├── clusterTime.json         # SLURM cluster configuration
│   ├── rules/
│   │   └── bcl2fastq.smk       # Main Snakemake workflow
│   ├── schemas/
│   │   └── config.schema.yaml  # Configuration schema validation
│   └── scripts/
│       ├── helper.py            # Helper functions
│       ├── PreProcess-Summary.sh # MultiQC summary script
│       └── versions.py         # Version tracking
│
└── nextflow-single-cell-pipeline/
    ├── main.nf                  # Main Nextflow pipeline
    ├── nextflow.config          # Nextflow configuration
    ├── modules/                 # Nextflow modules
    │   ├── fastqc.nf
    │   ├── fastq_screen.nf
    │   ├── filterfastq.nf
    │   ├── multiqc.nf
    │   ├── star_index.nf
    │   └── star_solo.nf
    ├── scripts/                 # Supporting scripts
    │   └── filterfastq.py
    ├── genotype/                # Genotyping-related files
    │   ├── plot.R
    │   └── workflow.sh
    └── make_whitelist.sh        # Whitelist generation script

⚙️ Configuration

Snakemake Configuration

Key parameters in config.yaml:

  • bcl2fastq: BCL conversion settings (sample sheet, run folder, threads)
  • cutadapt: Optional adapter trimming configuration
  • rseqc: Optional RSeQC analysis settings

Nextflow Configuration

Key parameters in nextflow.config:

  • input_dir: Path to input FASTQ files
  • outdir: Output directory
  • samples: List of sample names
  • fa: Reference genome FASTA file
  • gtf: Annotation GTF file
  • run_filtering: Enable/disable FASTQ filtering
  • Resource settings for each process

🔄 Workflow Integration

These workflows are designed to work sequentially:

  1. Snakemake-CreateAndQCRawData: Processes raw BCL files → demultiplexed FASTQ files with QC
  2. nextflow-single-cell-pipeline: Takes processed FASTQ files → single-cell analysis with alignment

The output from the Snakemake workflow (demultiplexed FASTQ files) can serve as input to the Nextflow pipeline for downstream single-cell analysis.


📊 Output Files

Snakemake Workflow

  • Demultiplexed FASTQ files in FASTQ/ directory
  • FastQC reports in FASTQC/ directory
  • FastQ Screen reports in FASTQSCREEN/ directory
  • MultiQC reports in MULTIQC/ directory (per-lane and aggregate)

Nextflow Pipeline

  • Filtered FASTQ files (if filtering enabled)
  • STAR genome index (if generated)
  • STARsolo alignment outputs
  • MultiQC reports
  • Pipeline execution reports (timeline, trace, DAG) in pipeline_info/

🛠️ Troubleshooting

Snakemake

  • Ensure sample sheet format is correct (validated by helper scripts)
  • Check SLURM cluster configuration in clusterTime.json
  • Verify all required tools are available in the environment

Nextflow

  • Ensure Singularity is properly configured
  • Verify reference genome files are accessible
  • Check resource requirements match available cluster resources
  • Review execution reports in pipeline_info/ for detailed error information

📝 Notes

  • Both workflows support cluster execution (SLURM)
  • Containerization (Singularity) is recommended for reproducibility
  • Configuration files contain example paths that should be updated for your environment
  • Sample sheets and reference files must be provided by the user

📄 License

[Add your license information here]


👤 Author

Sumeet Gupta

About

Two bioinformatics workflow pipelines for NGS data processing: Snakemake workflow for demultiplexing and quality control of raw sequencing data, and Nextflow pipeline for single-cell RNA sequencing analysis.

Resources

Stars

Watchers

Forks