Summary

Description:

This documentation provides a description of a pipeline to assemble a new human transcriptome, to annotate novel lncRNAs and to perform further downstream analyses. The "Core Evaluation of lncRNAs" is related to the part of the pipeline that transforms the identification of an RNA sequence into functional and biological knowledge.

Summary

Introduction
Details
Contributors
Citation
License
Developer

Introduction

This pipeline comprises the following steps and the respective algorithms:

(Click over the badge of the desired algorithm to check its GitHub repository or similar).

RNA-Seq raw reads download using the NCBI-SRA fasterq-dump tool;
Raw reads pre-processing and sequencing quality analyses using fastp and FastQC tools;
Mapping of trimmed reads to the human reference transcriptome and mapping quality analyses;
Assembly of human transcripts and new human transcriptome;
Mapping of trimmed reads to the new human transcriptome assembly;
Coding potential analyses and transcript annotation;
Reads counting and differential expression analysis;
Gene co-expression network and gene ontology analyses

Details

Steps 1 to 6 were run using bash programming language. Steps 7 and 8, using R programming language.

Step 1
The fasterq-dump algorithm was run using the default parameters.
Step 2
If you use sharp trimming tools or approaches, you may encounter low rates of uniquely mapped reads (steps 3 & 5), which may impair downstream analysis. Check it up. For that, fastp was run with default parameters.
Steps 3 & 5
STAR indexes generation and mapping were performed using the ENCODE standard options as described in this STAR manual and as described by Maciel, L.F. and Verjovski-Almeida, S. (2020).

After performing the second mapping, verify the percentage of uniquely mapped reads (recommended for gene expression analyses).

Mapping rates:
Low mapping rate | < 50% - indicative of a problem with library preparations or data processing.
Good mapping rate | 80 to 90%.
Very good mapping rate | superior to 90%.

To run the RSeQC modules tin.py and geneBody_coverage.py you will need to index the sorted bam files from the 2nd mapping beforehand. To do so you may use the index command of samtools. The bai files are not declared in the code but they must be stored in the directory for a succesfull run of these modules.

It is recommended to build one superdirectory for each PRJNA Project, bearing directories for each analyses and using symbolic links to needed files (star/index; star/1stMapping and star/2ndMapping).

Contributors

In alphabetical order:

Ana Carolina Tahira
Caio Felipe Freire de Sousa
Sergio Verjovski-Almeida
Thalles Souza-Lopes

Citation

Please, when using this pipeline use the following reference: Souza-Lopes et al., 2025 | DOI: .

License

GNU General Public License v3.0

Comments and Suggestions?

All comments and suggestions are appreciated. Please, submit them to thalleslopes.github@gmail.com.

Thank you!

Name		Name	Last commit message	Last commit date
Latest commit History 238 Commits
1_genomeFilesDownload		1_genomeFilesDownload
2_sraSamplesDownload		2_sraSamplesDownload
3_sequencingQualityAnalyses		3_sequencingQualityAnalyses
4_indexesAndMapping		4_indexesAndMapping
5_mappingQualityAnalyses		5_mappingQualityAnalyses
6_transcriptomeAssemblyandFormatation		6_transcriptomeAssemblyandFormatation
7_countMatrix		7_countMatrix
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description:

Summary

Introduction

Details

Contributors

Citation

License

Comments and Suggestions?

About

Uh oh!

Releases 1

Packages

Languages

License

TluckDucky/TraceLNC

Folders and files

Latest commit

History

Repository files navigation

Description:

Summary

Introduction

Details

Contributors

Citation

License

Comments and Suggestions?

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages