Metatranscriptome performance

Hey @bede,

Thanks for great tool! I've included it in my pipeline for host-microbiome dual transcriptomics, and our lab has been using it on a variety of projects.

I've been finding that it performs well for metagenomic-like simulated datasets, but for metatranscriptome-like simulated datasets, not all host reads are being removed. 


I wonder if this is due to alternative splicing, and if inclusion of a splice-aware aligner as one of the options could address this. What do you think? Could you think of ways to use a different reference to avoid needing to add a splice-aware aligner?



Details on my simulated data can be found [here](https://github.com/sterrettJD/HoMi/tree/main/benchmarking). To summarize the results:
1. If I naively simulate genomic-like community data with human reads from the human pangenome project, then run a linear regression on `percent_removed_reads ~ true_percent_host`:
    - $\beta$ = 0.999 (so almost flawless performance)
2. If I simulate transcriptome-like community data using Polyester, where human reads are coming from the rna.fna.gz file found [here](https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers), then run a linear regression on `percent_removed_reads ~ true_percent_host`:
    - $\beta$ = 0.856 (15% of host reads are missed)
3. If I create semisynthetic communities, where real RNA-seq reads from bacterial isolates + human colon chip samples are subsampled and combined in known quantities,, then run a linear regression on `percent_removed_reads ~ true_percent_host`:
    - $\beta$ = 0.932 (7% of host reads are missed)


Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metatranscriptome performance #56

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Metatranscriptome performance #56

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions