This repository contains two bioinformatics workflow pipelines for NGS data processing:
- Snakemake-CreateAndQCRawData - A Snakemake workflow for demultiplexing and quality control of raw sequencing data
- nextflow-single-cell-pipeline - A Nextflow pipeline for single-cell RNA sequencing analysis
A Snakemake-based workflow designed for the initial processing and quality control of raw NGS sequencing runs. This pipeline handles:
- Demultiplexing: Converts BCL files to FASTQ format using
bcl2fastq - Quality Control: Performs FastQC and FastQ Screen analysis on demultiplexed FASTQ files
- Report Generation: Creates MultiQC reports summarizing QC metrics on a per-lane basis
Key Features:
- Automated sample sheet validation
- Per-lane QC reporting
- SLURM cluster support with configurable resource allocation
- Optional adapter trimming with cutadapt
- Optional RSeQC analysis for RNA-seq data
Workflow Components:
bcl2fastq- BCL to FASTQ conversionfastqc- Quality control analysisfastqscreen- Contamination screeningmultiqc- Aggregate QC reports
A Nextflow-based pipeline for comprehensive single-cell RNA sequencing analysis, specifically designed for inDrop protocol data. This pipeline performs:
- Quality Control: FastQC and FastQ Screen analysis
- Read Filtering: Optional FASTQ filtering and preprocessing
- Genome Indexing: STAR genome index generation
- Alignment: STARsolo alignment for single-cell data
- Report Generation: MultiQC reports for quality assessment
Key Features:
- Containerized execution with Singularity
- Handles multi-read single-cell data (R1/R2/R4)
- Configurable filtering parameters
- Automatic STAR index management
- Comprehensive execution reports and DAG visualization
Pipeline Modules:
fastqc.nf- Quality control analysisfastq_screen.nf- Contamination screeningfilterfastq.nf- FASTQ filtering and preprocessingmultiqc.nf- Aggregate QC reportsstar_index.nf- STAR genome indexingstar_solo.nf- STARsolo single-cell alignment
Navigate to the Snakemake directory and configure your run:
cd Snakemake-CreateAndQCRawDataEdit config.yaml to specify:
- Sample sheet path
- Run folder path
- BCL2FASTQ parameters
- Optional cutadapt and RSeQC settings
Run the workflow:
snakemake -s rules/bcl2fastq.smk \
--cluster-config clusterTime.json \
--cluster "sbatch --mem={cluster.mem} --cpus-per-task={cluster.cpus_per_task} --account={cluster.account} --partition={cluster.slurm_partition} --output %j.out --wrap" \
--jobs 100 \
--latency-wait 60Navigate to the Nextflow directory:
cd nextflow-single-cell-pipelineEdit nextflow.config to configure:
- Input directory path
- Output directory
- Sample names
- Reference genome paths (FASTA and GTF)
- Resource requirements
Run the pipeline:
nextflow run main.nf -profile singularity- Snakemake (≥5.19.3)
- bcl2fastq
- FastQC
- FastQ Screen
- MultiQC
- SLURM (for cluster execution)
- Python 3 with pandas
- Nextflow
- Singularity (for containerized execution)
- Reference genome files (FASTA and GTF)
- STAR genome index (optional - can be generated by pipeline)
Workflows/
├── Snakemake-CreateAndQCRawData/
│ ├── config.yaml # Main configuration file
│ ├── clusterTime.json # SLURM cluster configuration
│ ├── rules/
│ │ └── bcl2fastq.smk # Main Snakemake workflow
│ ├── schemas/
│ │ └── config.schema.yaml # Configuration schema validation
│ └── scripts/
│ ├── helper.py # Helper functions
│ ├── PreProcess-Summary.sh # MultiQC summary script
│ └── versions.py # Version tracking
│
└── nextflow-single-cell-pipeline/
├── main.nf # Main Nextflow pipeline
├── nextflow.config # Nextflow configuration
├── modules/ # Nextflow modules
│ ├── fastqc.nf
│ ├── fastq_screen.nf
│ ├── filterfastq.nf
│ ├── multiqc.nf
│ ├── star_index.nf
│ └── star_solo.nf
├── scripts/ # Supporting scripts
│ └── filterfastq.py
├── genotype/ # Genotyping-related files
│ ├── plot.R
│ └── workflow.sh
└── make_whitelist.sh # Whitelist generation script
Key parameters in config.yaml:
bcl2fastq: BCL conversion settings (sample sheet, run folder, threads)cutadapt: Optional adapter trimming configurationrseqc: Optional RSeQC analysis settings
Key parameters in nextflow.config:
input_dir: Path to input FASTQ filesoutdir: Output directorysamples: List of sample namesfa: Reference genome FASTA filegtf: Annotation GTF filerun_filtering: Enable/disable FASTQ filtering- Resource settings for each process
These workflows are designed to work sequentially:
- Snakemake-CreateAndQCRawData: Processes raw BCL files → demultiplexed FASTQ files with QC
- nextflow-single-cell-pipeline: Takes processed FASTQ files → single-cell analysis with alignment
The output from the Snakemake workflow (demultiplexed FASTQ files) can serve as input to the Nextflow pipeline for downstream single-cell analysis.
- Demultiplexed FASTQ files in
FASTQ/directory - FastQC reports in
FASTQC/directory - FastQ Screen reports in
FASTQSCREEN/directory - MultiQC reports in
MULTIQC/directory (per-lane and aggregate)
- Filtered FASTQ files (if filtering enabled)
- STAR genome index (if generated)
- STARsolo alignment outputs
- MultiQC reports
- Pipeline execution reports (timeline, trace, DAG) in
pipeline_info/
- Ensure sample sheet format is correct (validated by helper scripts)
- Check SLURM cluster configuration in
clusterTime.json - Verify all required tools are available in the environment
- Ensure Singularity is properly configured
- Verify reference genome files are accessible
- Check resource requirements match available cluster resources
- Review execution reports in
pipeline_info/for detailed error information
- Both workflows support cluster execution (SLURM)
- Containerization (Singularity) is recommended for reproducibility
- Configuration files contain example paths that should be updated for your environment
- Sample sheets and reference files must be provided by the user
[Add your license information here]
Sumeet Gupta