diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
index 4a682b6..9635086 100644
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -25,6 +25,8 @@ jobs:
with:
mamba-version: "*"
channels: conda-forge,bioconda
+ channel-priority: strict
+ conda-remove-defaults: true
auto-activate-base: false
activate-environment: psqan_venv
environment-file: environment.yml
diff --git a/README.md b/README.md
index 3f6f610..8c2dbed 100644
--- a/README.md
+++ b/README.md
@@ -6,10 +6,10 @@
[](https://github.com/sid-sethi/PSQAN/actions/workflows/main.yml)
-
-
## Table of contents
+- [Overview](#overview)
- [Introduction](#introduction)
+- [Description](#description)
- [Input data](#input-data)
- [Normalising transcript expression per gene](#normalising-transcript-expression-per-gene)
- [Isoform categorisation](#isoform-categorisation)
@@ -18,23 +18,27 @@
- [Dependencies](#dependencies)
- [Installation](#installation)
- [Input](#input)
+ - [Practical Suggestions](#practical-suggestions)
- [Usage](#usage)
- [Output](#output)
- - [Directory structure](#directory-structure)
+ - [Directory structure](#directory-structure)
+ - [Visualisations](#visualisations)
- [Transcript detected vs. expression](#transcript-detected-vs-expression)
- [Transcript count](#transcript-count)
- [Transcript expression](#transcript-expression)
- [Ranked transcripts](#ranked-transcripts)
- - [Other useful commands](#other-useful-commands)
- - [Licence](#licence)
+- [Other useful commands](#other-useful-commands)
+- [Licence](#licence)
+## Overview
+
## Introduction
-Despite the advances in tools to process long-read RNA-seq data, the downstream analysis of transcriptional data remains challenging due to the detection of thousands of novel transcripts. From such a large number of transcripts, it is difficult to distinguish between stable transcripts of potential biological importance, partially processed RNAs and splicing noise. It is important to select only the novel transcript models which are reproducible across the samples with a minimum expression value. However, it is difficult to identify optimal expression thresholds to remove artefacts. Consequently, researchers find it challenging to interpret long-read RNA-seq data effectively and generate relevant hypothesis which could be experimentally validated in the laboratory.
-
-PSQAN (Post Sqanti QC ANalysis) is a Snakemake workflow designed to help researchers identify high-confidence transcripts associated with candidate genes. PSQAN performs a gene-based analysis on characterised transcripts generated by [SQANTI3](https://github.com/ConesaLab/SQANTI3 "SQANTI homepage") and [TALON](https://github.com/mortazavilab/TALON/tree/master "TALON homepage"). PSQAN normalises transcript expression per gene and re-groups transcripts into actionable categories to support transcript prioritisation, hence making the results more interpretable. PSQAN generates visualisations to help users determine optimal expression thresholds for detecting both known and novel transcripts of probable biological importance. Furthermore, PSQAN allows users to apply multiple transcript level expression thresholds, both to per sample and across all samples. Lastly, PSQAN generates visualisations and an HTML report, enabling users to explore the known and novel transcripts expressed by a gene, alongside their transcript categories and transcript expression. An example of the report generated by PSQAN for a single gene can be downloaded [here](example_output/report.html).
+Despite the advances in tools to process long-read RNA-sequencing (lrRNA-seq) data, the downstream analysis of transcriptional data remains challenging due to the detection of thousands of novel transcripts and the lack of tools to prioritise functionally important transcripts. From such a large number of transcripts, it is difficult to distinguish between stable transcripts of potential biological importance, partially processed RNAs and splicing noise. Furthermore, when using lrRNA-seq to identify rare and novel transcripts, the recommendation is to incorporate multiple replicates in the study design and implement transcript-level filters. However, determining optimal expression thresholds for filtering and selecting transcripts which are reproducible across samples remains a significant challenge. Consequently, researchers find it challenging to interpret lrRNA-seq data effectively and generate relevant hypothesis which could be experimentally validated in the laboratory.
+PSQAN (**P**ost-transcriptomic **S**tructural **Q**uality **A**ssessment and **N**ormalisation) is a Snakemake workflow designed to help researchers identify high-confidence transcripts associated with candidate genes. PSQAN performs a gene-based analysis on characterised transcripts generated by [SQANTI3](https://github.com/ConesaLab/SQANTI3 "SQANTI homepage") and [TALON](https://github.com/mortazavilab/TALON/tree/master "TALON homepage"). PSQAN normalises transcript expression per gene and re-groups transcripts into actionable categories to support transcript prioritisation, hence making the results more interpretable. PSQAN generates visualisations to help users determine optimal expression thresholds for detecting both known and novel transcripts of probable biological importance. Furthermore, PSQAN allows users to apply multiple transcript level expression thresholds, both to per sample and across all samples. Lastly, PSQAN generates visualisations and an HTML report, enabling users to explore the known and novel transcripts expressed by a gene, alongside their transcript categories and transcript expression. An example of the report generated by PSQAN for a single gene can be downloaded [here](example_output/report.html).
+## Description
### Input data
PSQAN can be used with the transcript characterisation output of either SQANTI3 or TALON, which are the two most prominently used tools in long-read RNA-seq data analysis. PSQAN takes the output produced by SQANTI3 or TALON as input, along with a list of candidate genes to analyse. For each gene, PSQAN extracts the isoforms associated with the gene from the output generated by SQANTI3/TALON. Since the filtering steps in SQANTI3 and TALON are optional and may be skipped, PSQAN applies its own filtering criteria prior to processing to ensure the removal of potential genomic contamination and rare PCR artifacts. PSQAN removes isoforms with a high percentage of genomic "A"s in their downstream 20 bp window (80% is the default), or if one of its junctions is predicted to be a template switching artifact (tagged as "RTS_stage" by SQANTI3).
@@ -50,17 +54,16 @@ Isoform re-categorisation | Yes | No (missing required data)
Transcript-level filtering | Yes | Yes
Visualisations | Yes | Yes
-
-
### Normalising transcript expression per gene
+PSQAN calculates the normalised full-length reads for each transcript (*NFLR**Ti*), which quantifies transcript expression as the percentage of total gene transcription. This normalisation emphasises transcript usage relative to overall gene output, thereby simplifying interpretation. For instance, a transcript with an NFLR value of 10.0 would imply that it accounts for 10% towards all transcripts generated from the gene locus. PSQAN’s normalisation also removes variation due to overall gene expression differences between samples, hence making comparisons of transcript usage independent of absolute gene expression.
-Given a transcript *T* in sample *i* with *FLR* as the number of full-length reads mapped to the transcript *T*, PSQAN calculates the normalised full-length reads (*NFLR**Ti*) as the **percentage of total gene transcription in the sample**:
+Given a transcript *T* in sample *i* with *FLR* as the number of full-length reads mapped to the transcript *T*, PSQAN calculates the normalised full-length reads (*NFLR**Ti*) as:
@@ -68,25 +71,25 @@ where, *NFLR**Ti* represents the normalised full-length read count of
where, *NFLR**T* represents the mean expression of transcript *T* across all samples and *N* is the total number of samples.
-> **_Example:_** For instance, a transcript with a *NFLR* value of *10.0* would mean that it contributes *10%* towards the total transcription of the gene.
+> **_Note:_** PSQAN's normalisation approach has certain limitations that should be considered. First, because this metric does not account for sequencing depth, it may introduce bias if samples with very low coverage are analysed alongside high-coverage samples. Second, it may not be suitable on its own for differential expression analysis at either the gene or transcript level. Third, transcript usage derived from this metric cannot be compared across different genes, as each gene is normalised independently.
-### Isoform categorisation
-For SQANTI output, PSQAN also performs isoform re-categorisation using SQANTI’s output of ORF prediction, NMD prediction and structural categorisation based on comparison with the reference annotation. PSQAN groups the identified isoforms into the following categories:
+### Isoform category re-grouping
+If PSQAN is used with the output of SQANTI3, it also performs isoform re-grouping into categories which are easy to interpret and facilitates prioritising potentially relevant transcripts. Using the open reading frame (ORF) prediction, nonsense-mediated decay (NMD) prediction and structural categorisation (based on the comparison with reference annotation) of SQANTI3, PSQAN groups the identified isoforms into the following seven categories:
- Non-coding novel - if predicted to be non-coding and not a full-splice match with the reference
- Non-coding known - if predicted to be non-coding and a full-splice match with the reference
- - NMD novel - if predicted to be coding & NMD, and not a full-splice match with the reference
- - NMD known - if predicted to be coding & NMD, and a full-splice match with the reference
- - Coding novel - if predicted to be coding & not NMD, and not a full-splice match with the reference
- - Coding known (complete match) - if predicted to be coding & not NMD, and a full-splice & UTR match with the reference
- - Coding known (alternate 3'/5' end) - if predicted to be coding & not NMD, and a full-splice match with the reference but with an alternate 3’ end, 5’ end or both 3’ and 5’ end.
+ - NMD novel - if predicted to be coding and NMD, and not a full-splice match with the reference
+ - NMD known - if predicted to be coding and NMD, and a full-splice match with the reference
+ - Coding novel - if predicted to be coding and not NMD, and not a full-splice match with the reference
+ - Coding known (complete match) - if predicted to be coding and not NMD, and a full-splice and untranslated region match with the reference
+ - Coding known (alternate 3'/5' end) - if predicted to be coding and not NMD, and a full-splice match with the reference but with an alternate 3’ end, 5’ end or both 3’ and 5’ end.
-### Filtering isoforms
+### Transcript-level filtering
+PSQAN implements filtering strategies which provides flexibile and data-driven refinement:
+1. a minimum threshold on transcript expression per sample (*NFLR**Ti*)
+2. a minimum threshold on the mean expression across all samples (*NFLR**T*) - not applicable if data has only one sample
+3. a minimum percentage of samples in which a transcript must meet the minimum per sample expression threshold - not applicable if data has only one sample
-PSQAN can be first run with default filtering thresholds in order to generate pre-filtering visualisations. After exploring the pre-filtering visualisations, appropriate thresholds can be determined and PSQAN can be run again with the determined thresholds. PSQAN allows filtering of transcripts based on the following three values:
-- Minimum value of normalised expression required for a transcript **PER** sample (or replicate)
-- Minimum value of mean normalised expression **ACROSS** all samples required for a transcript (not applicable if data has only one sample)
-- Minimum % of samples which should pass the minimum per sample threshold (not applicable if data has only one sample)
-
+Furthermore PSQAN provides a visualisation of the [number of detected transcripts as a function of varying expression thresholds](#transcript-detected-vs-expression), showing the number of transcripts which will be retained at every *NFLR**T* threshold. This plot enables users to visually inspect and determine an appropriate *NFLR**T* threshold for retaining high-confidence transcripts. In datasets with multiple samples, PSQAN generates a *NFLR**Ti* curve per sample, allowing researchers to examine the variability in transcript detection across samples. This visualisation supports informed decision-making regarding: (a) the minimum expression value at which a transcript should be expressed within each sample; and (b) the minimum number of samples in which a transcript must meet the minimum expression threshold to be considered reproducible.
## Getting Started
@@ -113,7 +116,7 @@ conda env create -f environment.yml --prefix psqan_venv
- SQANTI output (classification.txt) or TALON output (read_annot.tsv)
- A file containing Gene IDs of genes of interest to analyse
-Example of gene id file:
+[Example](tests/test_data/sqanti_genes.txt) of gene id file:
```
gene_id
GeneID1
@@ -121,6 +124,9 @@ GeneID2
GeneID3
```
+### Practical Suggestions
+- Please see the note on PSQAN limitations under the [normalisation section](#normalising-transcript-expression-per-gene). We recommend using PSQAN on samples with comparable sequencing depth.
+- PSQAN can first be run using the default filtering thresholds to generate pre-filtering visualisations. After examining these visualisations, users can determine suitable thresholds and rerun PSQAN with the updated parameters.
### Usage
@@ -175,28 +181,29 @@ working directory
|--- Gene_B/
|--- Gene_C/
```
+#### Visualisations
+PSQAN generates multiple visualisations to aid in the interpretation of results. In addition to visualising [transcripts across varying expression thresholds](#transcript-detected-vs-expression), PSQAN plots the [number of transcripts detected in each isoform category](#transcript-count) and their [normalised expression](#transcript-expression), both before and after filtering. For datasets with multiple samples, PSQAN also computes variability across the samples as standard deviation, which is displayed as error bars. PSQAN displays all transcripts associated with a gene [ranked by their normalised expression](#ranked-transcripts) and coloured by transcript category, allowing users to easily identify dominant transcripts of a gene. Lastly, PSQAN provides an option to generate a gene-level [HTML report](example_output/report.html), compiling all visualisations to facilitate result interpretation.
-#### Transcript detected vs. expression
-
+##### Transcript detected vs. expression
Number of transcripts detected as a function of varying expression thresholds. This plot can be used to determine the minimum expression threshold to identify high-confidence transcripts.
diff --git a/images/figure_for_github.png b/images/figure_for_github.png
new file mode 100755
index 0000000..1629d45
Binary files /dev/null and b/images/figure_for_github.png differ
diff --git a/images/psqan_model.png b/images/psqan_model.png
deleted file mode 100755
index b9de3be..0000000
Binary files a/images/psqan_model.png and /dev/null differ