LongBench: a multi-platform reference dataset spanning bulk, single-cell, and single-nucleus transcriptomics across eight human lung cancer cell lines with synthetic spike-in controls.

Content

Introduction
Applications
Data Availability
Data Analysis Tutorials
Citing the LongBench project

Introduction:

LongBench is a comprehensive benchmarking dataset designed to fill these critical gaps. Derived from eight lung cancer cell lines with synthetic RNA spike-ins, LongBench includes bulk, single-cell, and single-nucleus RNA-seq data from three state-of-the-art long-read sequencing platforms — ONT PCR-cDNA, ONT direct RNA, and PacBio Kinnex — alongside Illumina short-read data for robust cross-platform comparisons.

The LongBench dataset is a valuable resource for benchmarking and improving sequencing protocols and bioinformatics tools. With the LongBench dataset we present a systematic evaluation of transcript capture, quantification, and differential expression analyses, examining the strengths and limitations of each sequencing platform in various biological contexts, enabling researchers to make more informed decisions on platform and method selection.

More details of the dataset can be found here.

Applications

Some example applications of the LongBench dataset include:

Studying biological questions including isoform diversity, variant calling, allele-specific expression, and RNA modifications across lung cancer subtypes.
Benchmarking analysis tools and method development for isoform quantification, differential expression (DE), differential transcript usage (DTU), and variant calling.
Cross-platform comparisons of accuracy, sensitivity, and biases between ONT, PacBio, and Illumina sequencing technologies.

Data Availability

The LongBench dataset is publicly hosted on AWS S3 and includes:

raw FASTQ – bulk, single-cell, and single-nucleus RNA-seq
POD5 files – ONT direct-RNA reads for RNA-modification analysis
Count matrices - processed gene/transcript expression tables
Cell line annotation - metadata for single-cell and single-nucleus datasets

Download Instructions

Install the AWS CLI:

pip install awscli

list data in the s3 bucket

aws s3 ls s3://longbench-data/ --no-sign-request

Download data

 aws s3 sync s3://longbench-data/<dir name> <local dir name> --no-sign-request

Data Analysis Tutorials and Workflows

The following tutorials are available that demonstrate how to analyse the LongBench data:

Citing the LongBench project

If you use the LongBench dataset, please cite:

Yupei You, Ashleigh Solano, James Lancaster, Margaux David, Changqing Wang, Shian Su, Kathleen Zeglinski, Reza Ghamsari, Manveer Chauhan, Josie Gleeson, Yair D. J. Prawer, Jin Ng, Benedicte Dubois, Isabelle Cleynen, Marie-Liesse Asselin-Labat, Kate D. Sutherland, Michael B. Clark, Quentin Gouil, Matthew E. Ritchie. Benchmarking long-read RNA-sequencing technologies with LongBench: a cross-platform reference dataset profiling cancer cell lines with bulk and single-cell approaches. bioRxiv 2025.09.11.675724. https://doi.org/10.1101/2025.09.11.675724

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
content		content
docs		docs
layouts		layouts
source		source
static		static
themes		themes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
build.sh		build.sh
config.toml		config.toml
index.Rmd		index.Rmd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LongBench: a multi-platform reference dataset spanning bulk, single-cell, and single-nucleus transcriptomics across eight human lung cancer cell lines with synthetic spike-in controls.

Content

Introduction:

Content

Applications

Data Availability

Download Instructions

Data Analysis Tutorials and Workflows

Citing the LongBench project

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

mritchielab/LongBench.io

Folders and files

Latest commit

History

Repository files navigation

LongBench: a multi-platform reference dataset spanning bulk, single-cell, and single-nucleus transcriptomics across eight human lung cancer cell lines with synthetic spike-in controls.

Content

Introduction:

Content

Applications

Data Availability

Download Instructions

Data Analysis Tutorials and Workflows

Citing the LongBench project

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages