This repository contains a modular Snakemake pipeline for single-cell RNA sequencing (scRNA-seq) data analysis. It automates quality control, integration, cell type annotation (using CellTypist), and marker gene identification, ensuring reproducibility and scalability for large datasets.
The workflow performs the following steps:
-
Quality control (QC) per sample
- Generates cleaned
.h5adfiles - Creates UMAP plots
- Exports per-cluster marker genes
- Generates cleaned
-
Integration of all cleaned samples into a single dataset using Harmony
-
Cell type annotation using CellTypist
-
Marker gene ranking for integrated clusters
Expected project layout:
project/
ββ Snakefile
ββ samples.csv # metadata with sample IDs and paths
ββ scripts/ # analysis scripts called by Snakemake
β ββ 02_qc.py
β ββ 03_integration.py
β ββ 04_celltyping.py
β ββ 05_markers.py
ββ environments/ # conda environments
β ββ sc.yml
ββ data/ # optional: raw input files (.h5ad)
ββ 02_output/ # cleaned per-sample data
ββ 03_output/ # integrated dataset
ββ 04_output/ # annotated data + markers
ββ qc/ # QC plots and markers per sample
The pipeline requires a samples.csv file with at least two columns:
- sample_id β short name for the sample (used in file naming)
- filtered_matrix_path β path to the raw/filtered
.h5file (output from Cell Ranger)
Example:
sample_id,filtered_matrix_path
S01,/path/to/S01_filtered.h5
S02,/data/run_A/S02_filtered.h5
1. Setup conda environment and install Snakemake
conda install bioconda::snakemake
snakemake --use-conda --cores 4
- Snakemake for workflow management
- Harmony for integration
- CellTypist for cell type annotation
- Developed as part of a project at the Kuppe Lab of Quantitative Cell Dynamics and Translational Systems Biology, UKAachen