This documentation provides a description of a pipeline to assemble a new human transcriptome, to annotate novel lncRNAs and to perform further downstream analyses. The "Core Evaluation of lncRNAs" is related to the part of the pipeline that transforms the identification of an RNA sequence into functional and biological knowledge.
This pipeline comprises the following steps and the respective algorithms:
(Click over the badge of the desired algorithm to check its GitHub repository or similar).
- RNA-Seq raw reads download using the NCBI-SRA fasterq-dump tool;
- Raw reads pre-processing and sequencing quality analyses using fastp and FastQC tools;
- Mapping of trimmed reads to the human reference transcriptome and mapping quality analyses;
- Assembly of human transcripts and new human transcriptome;
- Mapping of trimmed reads to the new human transcriptome assembly;
- Coding potential analyses and transcript annotation;
- Reads counting and differential expression analysis;
- Gene co-expression network and gene ontology analyses
Steps 1 to 6 were run using bash programming language. Steps 7 and 8, using R programming language.
-
Step 1
The fasterq-dump algorithm was run using the default parameters. -
Step 2
If you use sharp trimming tools or approaches, you may encounter low rates of uniquely mapped reads (steps 3 & 5), which may impair downstream analysis. Check it up. For that, fastp was run with default parameters. -
Steps 3 & 5
STAR indexes generation and mapping were performed using the ENCODE standard options as described in this STAR manual and as described by Maciel, L.F. and Verjovski-Almeida, S. (2020).
After performing the second mapping, verify the percentage of uniquely mapped reads (recommended for gene expression analyses).
Mapping rates:
Low mapping rate | < 50% - indicative of a problem with library preparations or data processing.
Good mapping rate | 80 to 90%.
Very good mapping rate | superior to 90%.
To run the RSeQC modules tin.py and geneBody_coverage.py you will need to index the sorted bam files from the 2nd mapping beforehand. To do so you may use the index command of samtools. The bai files are not declared in the code but they must be stored in the directory for a succesfull run of these modules.
It is recommended to build one superdirectory for each PRJNA Project, bearing directories for each analyses and using symbolic links to needed files (star/index; star/1stMapping and star/2ndMapping).
In alphabetical order:
- Ana Carolina Tahira
- Caio Felipe Freire de Sousa
- Sergio Verjovski-Almeida
- Thalles Souza-Lopes
Please, when using this pipeline use the following reference: Souza-Lopes et al., 2025 | DOI: .
All comments and suggestions are appreciated. Please, submit them to thalleslopes.github@gmail.com.
Thank you!