-
Notifications
You must be signed in to change notification settings - Fork 0
Description
The typical behavior of TranD 1 GTF geneMode has keepIR set to False. This removes any transcripts with exons overlapping intron areas (removing intron retention transcripts).
This can result in a case where every transcript of a multi-transcript gene is removed, except for 1, essentially making the gene a single-transcript gene.
TranD ER IDs are meant to look like this:
GENE:ER1, GENE:ER2,...
However, when the stated situation occurs, they instead look like this:
TRANSCRIPT_exon_1, TRANSCRIPT_exon_2,....
This is because what is essentially now a single-transcript gene is going through the multi-transcript gene process.
This error occurs in event_analysis.py.
The method do_ea_gene (line 504) in handles single-transcript and multi-transcript genes separately to assure they get the correct ER identifiers. However, the IR transcript removal using remove_ir_transcripts only occurs in the multi-transcript gene processing (lines 528-559), meaning the newly single-transcript genes continue through the multi-transcript gene process without any checks for single-transcript genes.
Potential solution:
The method get_strand (line 262) is where exon regions are created using BedTool.cat on every transcript (tx) in dictionary of transcripts (tx_data) for a gene. The variable all_tx is supposed to merge the exons of every transcript in a gene into an exon region using BedTool.cat. If there is only one transcript in the gene, no merge occurs. This creates a formatting issue where all_tx is is not in the correct format for ER IDs to be properly created. To fix this:
Add all_tx = all_tx.cat(tx, postmerge=True) after line 269.
This simply merges the transcript's bedtool information with itself to get it into the correct format for creating ER IDs. However, this is just an untested bandaid fix.