Workflows

This repository contains two bioinformatics workflow pipelines for NGS data processing:

Snakemake-CreateAndQCRawData - A Snakemake workflow for demultiplexing and quality control of raw sequencing data
nextflow-single-cell-pipeline - A Nextflow pipeline for single-cell RNA sequencing analysis

📁 Project Overview

1. Snakemake-CreateAndQCRawData

A Snakemake-based workflow designed for the initial processing and quality control of raw NGS sequencing runs. This pipeline handles:

Demultiplexing: Converts BCL files to FASTQ format using bcl2fastq
Quality Control: Performs FastQC and FastQ Screen analysis on demultiplexed FASTQ files
Report Generation: Creates MultiQC reports summarizing QC metrics on a per-lane basis

Key Features:

Automated sample sheet validation
Per-lane QC reporting
SLURM cluster support with configurable resource allocation
Optional adapter trimming with cutadapt
Optional RSeQC analysis for RNA-seq data

Workflow Components:

bcl2fastq - BCL to FASTQ conversion
fastqc - Quality control analysis
fastqscreen - Contamination screening
multiqc - Aggregate QC reports

2. nextflow-single-cell-pipeline

A Nextflow-based pipeline for comprehensive single-cell RNA sequencing analysis, specifically designed for inDrop protocol data. This pipeline performs:

Quality Control: FastQC and FastQ Screen analysis
Read Filtering: Optional FASTQ filtering and preprocessing
Genome Indexing: STAR genome index generation
Alignment: STARsolo alignment for single-cell data
Report Generation: MultiQC reports for quality assessment

Key Features:

Containerized execution with Singularity
Handles multi-read single-cell data (R1/R2/R4)
Configurable filtering parameters
Automatic STAR index management
Comprehensive execution reports and DAG visualization

Pipeline Modules:

fastqc.nf - Quality control analysis
fastq_screen.nf - Contamination screening
filterfastq.nf - FASTQ filtering and preprocessing
multiqc.nf - Aggregate QC reports
star_index.nf - STAR genome indexing
star_solo.nf - STARsolo single-cell alignment

🚀 Quick Start

Snakemake Workflow

Navigate to the Snakemake directory and configure your run:

cd Snakemake-CreateAndQCRawData

Edit config.yaml to specify:

Sample sheet path
Run folder path
BCL2FASTQ parameters
Optional cutadapt and RSeQC settings

Run the workflow:

snakemake -s rules/bcl2fastq.smk \
  --cluster-config clusterTime.json \
  --cluster "sbatch --mem={cluster.mem} --cpus-per-task={cluster.cpus_per_task} --account={cluster.account} --partition={cluster.slurm_partition} --output %j.out --wrap" \
  --jobs 100 \
  --latency-wait 60

Nextflow Pipeline

Navigate to the Nextflow directory:

cd nextflow-single-cell-pipeline

Edit nextflow.config to configure:

Input directory path
Output directory
Sample names
Reference genome paths (FASTA and GTF)
Resource requirements

Run the pipeline:

nextflow run main.nf -profile singularity

📋 Requirements

Snakemake Workflow

Snakemake (≥5.19.3)
bcl2fastq
FastQC
FastQ Screen
MultiQC
SLURM (for cluster execution)
Python 3 with pandas

Nextflow Pipeline

Nextflow
Singularity (for containerized execution)
Reference genome files (FASTA and GTF)
STAR genome index (optional - can be generated by pipeline)

📂 Directory Structure

Workflows/
├── Snakemake-CreateAndQCRawData/
│   ├── config.yaml              # Main configuration file
│   ├── clusterTime.json         # SLURM cluster configuration
│   ├── rules/
│   │   └── bcl2fastq.smk       # Main Snakemake workflow
│   ├── schemas/
│   │   └── config.schema.yaml  # Configuration schema validation
│   └── scripts/
│       ├── helper.py            # Helper functions
│       ├── PreProcess-Summary.sh # MultiQC summary script
│       └── versions.py         # Version tracking
│
└── nextflow-single-cell-pipeline/
    ├── main.nf                  # Main Nextflow pipeline
    ├── nextflow.config          # Nextflow configuration
    ├── modules/                 # Nextflow modules
    │   ├── fastqc.nf
    │   ├── fastq_screen.nf
    │   ├── filterfastq.nf
    │   ├── multiqc.nf
    │   ├── star_index.nf
    │   └── star_solo.nf
    ├── scripts/                 # Supporting scripts
    │   └── filterfastq.py
    ├── genotype/                # Genotyping-related files
    │   ├── plot.R
    │   └── workflow.sh
    └── make_whitelist.sh        # Whitelist generation script

⚙️ Configuration

Snakemake Configuration

Key parameters in config.yaml:

bcl2fastq: BCL conversion settings (sample sheet, run folder, threads)
cutadapt: Optional adapter trimming configuration
rseqc: Optional RSeQC analysis settings

Nextflow Configuration

Key parameters in nextflow.config:

input_dir: Path to input FASTQ files
outdir: Output directory
samples: List of sample names
fa: Reference genome FASTA file
gtf: Annotation GTF file
run_filtering: Enable/disable FASTQ filtering
Resource settings for each process

🔄 Workflow Integration

These workflows are designed to work sequentially:

Snakemake-CreateAndQCRawData: Processes raw BCL files → demultiplexed FASTQ files with QC
nextflow-single-cell-pipeline: Takes processed FASTQ files → single-cell analysis with alignment

The output from the Snakemake workflow (demultiplexed FASTQ files) can serve as input to the Nextflow pipeline for downstream single-cell analysis.

📊 Output Files

Snakemake Workflow

Demultiplexed FASTQ files in FASTQ/ directory
FastQC reports in FASTQC/ directory
FastQ Screen reports in FASTQSCREEN/ directory
MultiQC reports in MULTIQC/ directory (per-lane and aggregate)

Nextflow Pipeline

Filtered FASTQ files (if filtering enabled)
STAR genome index (if generated)
STARsolo alignment outputs
MultiQC reports
Pipeline execution reports (timeline, trace, DAG) in pipeline_info/

🛠️ Troubleshooting

Snakemake

Ensure sample sheet format is correct (validated by helper scripts)
Check SLURM cluster configuration in clusterTime.json
Verify all required tools are available in the environment

Nextflow

Ensure Singularity is properly configured
Verify reference genome files are accessible
Check resource requirements match available cluster resources
Review execution reports in pipeline_info/ for detailed error information

📝 Notes

Both workflows support cluster execution (SLURM)
Containerization (Singularity) is recommended for reproducibility
Configuration files contain example paths that should be updated for your environment
Sample sheets and reference files must be provided by the user

📄 License

[Add your license information here]

👤 Author

Sumeet Gupta

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Snakemake-CreateAndQCRawData		Snakemake-CreateAndQCRawData
nextflow-single-cell-pipeline		nextflow-single-cell-pipeline
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Workflows

📁 Project Overview

1. Snakemake-CreateAndQCRawData

2. nextflow-single-cell-pipeline

🚀 Quick Start

Snakemake Workflow

Nextflow Pipeline

📋 Requirements

Snakemake Workflow

Nextflow Pipeline

📂 Directory Structure

⚙️ Configuration

Snakemake Configuration

Nextflow Configuration

🔄 Workflow Integration

📊 Output Files

Snakemake Workflow

Nextflow Pipeline

🛠️ Troubleshooting

Snakemake

Nextflow

📝 Notes

📄 License

👤 Author

About

Uh oh!

sumeetg23/Workflows

Folders and files

Latest commit

History

Repository files navigation

Workflows

📁 Project Overview

1. Snakemake-CreateAndQCRawData

2. nextflow-single-cell-pipeline

🚀 Quick Start

Snakemake Workflow

Nextflow Pipeline

📋 Requirements

Snakemake Workflow

Nextflow Pipeline

📂 Directory Structure

⚙️ Configuration

Snakemake Configuration

Nextflow Configuration

🔄 Workflow Integration

📊 Output Files

Snakemake Workflow

Nextflow Pipeline

🛠️ Troubleshooting

Snakemake

Nextflow

📝 Notes

📄 License

👤 Author

About

Resources

Uh oh!

Stars

Watchers

Forks