Skip to content

Vertical integration of datasets #69

@johnne

Description

@johnne

Thanks for a great package and very useful publications!

I'm working with a time-series of samples for which metagenomic (mg) and metatranscriptomic (mt) data has been generated using Illumina sequencing. Reads from all samples have been mapped to the same features (a collection of genes predicted in Metagenome Assembled Genomes (MAGs)) and these features have been annotated functionally with PROKKA and eggnog-mapper.

There are a total of 84 samples taken in the same location at different dates over three years and for 19 sampling dates there's both mg and mt data.

In one part of the project I'm trying to identify MAGs that are differentially abundant in the mt dataset compared to the mg dataset, i.e. whose transcriptional activity is significantly higher/lower. My first idea was to use ALDEx2 for this purpose which in my understanding is what you refer to in your Quinn et al 2019 paper under Vertical data integration:

This allows us to use ALDEx2 to find features where mRNA abundance changes more than protein abundance, relative to a common reference (and vice versa).

which in my case would translate to something like "find MAGs where mRNA abundance changes more than DNA abundance, relative to a common reference".

What I've done so far is:

  1. subset both datasets to the 19 dates with paired omics data (all samples combined into one counts table)
  2. sum raw counts for each MAG
  3. use the omics type as conditions ('mg' and 'mt')
  4. run a modular ALDEx2 analysis with paired t-test and effect size estimation

To me this seems like a reasonable approach but I'm not sure if I'm missing something? Is this more or less what you intended in the paper? Also, the results will be wrt the geometric mean of the full dataset, but is there a better reference to choose in this case?

I'm also wondering if the 'differential proportionality analysis' could be useful in this context? I've followed along the 1-pipeline.R script from the Supplementary information of your field-guide paper and have tried to understand how (if) it can be applied to my data.

The section of the script that runs:

# Get LPS-treated cells only
rna <- rnaseq.no0[rnaseq.annot$Treatment == "LPS",]
pro <- masshl.no0[masshl.annot$Treatment == "LPS",]

# Join as single matrix
merge <- rbind(rna, pro)
group <- c(rep("RNA", 14), rep("Protein", 14))

# Run propd analysis
pd.ms <- propd(merge, group)

and the downstream processing identifies a reference which the features are compared to. I'm not sure if that makes sense for my data because I would have to choose one MAG as a reference and compare the the other MAGs to that one. Maybe this makes sense, I just have a hard time understanding how to interpret something like that.

As an alternative, could I perhaps compare the gene-level abundances for each MAG in the mg and mt datasets and choose a house-keeping gene as a reference for each comparison? So the procedure would be something like:

  1. for each MAG (or a subset of MAGs) sum the counts in the mg and mt samples to the feature level (e.g. COGs or KEGG orthologs)
  2. run propd analysis on the feature-level data for the MAG, setting my mg and mt groups.
  3. explore the results using a house-keeping gene (such as Rpl30 or similar) as reference

I'm thinking this could be a complementary analysis to the ALDEx2 analysis I'm already doing where I'm looking at the MAGs as a whole.

As you might tell I'm struggling a bit here as I feel I almost grasp the concepts but not quite. I would be very grateful for any input you might have on this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions