To examine the in situ activity of protists, Lambert et al., 2022 developed a machine learning model to predict the trophic mode of marine protist species based on gene expression from metatranscriptomes.
- The Lambert model code can be found here.
Recent studies (Groussman et al., 2023; Lasek-Nesselquist and Johnson, 2019; Van Vlierberghe et al., 2021) identified that a number of the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP) transcriptomes used for training the model have a low number of sequences and/or high contamination.
We set out to improve the Lambert model as we expected the inclusion of low-sequence and contaminated transcriptomes used for model training likely added noise to the selection of feature Pfams and reduced the accuracy of trophic predictions.
- Updated model software
- Removed contaminated and low-sequence transcriptomes.
These changes did not increase the accuracy of trophic predictions. However:
- The set of feature Pfams needed for reliable predictions was reduced from 1046 to 183 feature Pfams.
- More info on 183 feature Pfams can be found in
furtherInfo/featurePfams.xlsx.
After removing contaminated and low-sequence transcriptomes, we conducted feature selection using mean decrease in accuracy to identify the feature Pfams essential for model performance.
-
Feature selection
- Training dataset is unbalanced, more phototrophic transcriptomes than heterotrophic and mixotrophic transcriptomes.
- To address this, phototrophic transcriptomes were randomly undersampled to create four more balanced datasets:
- Number of phototrophic transcriptomes = 50, 80, 100, 120
- Mixotrophic and heterotrophic transcriptomes were included in full.
- Datasets with undersampled phototrophic transcriptomes can be found here.
- Script:
modelDevelopmentTesting/mda.py.
-
Hyperparameter optimization
- A grid search was performed to optimize model parameters.
- Used one dataset with 100 phototrophic transcriptomes and all mixotrophic and heterotrophic transcriptomes, located here.
- Script:
modelDevelopmentTesting/parameter_gridsearch.py.
The performance of MarPRISM was estimated using cross-validation:
- Cross-validation can be run in
modelDevelopmentTesting/marPRISM_crossValidation.ipynb. - The model was trained on 83% of the training data and tested on the remaining data.
- Performance was evaluated using F1 score (Mean F1 score ± standard error)
- Overall mean: 0.944 ± 0.0154
- Heterotrophy mean: 0.958 ± 0.0271
- Mixotrophy mean: 0.888 ± 0.0254
- Phototrophy mean: 0.961 ± 0.0124
Mixotrophy was the most difficult trophic mode to predict, likely due to overlapping Pfams with both phototrophy and heterotrophy.
To further validate, MarPRISM was tested on cultured protist transcriptomes excluded from training:
- Accuracy (correct prediction across all replicates): 21/27 (77.78%)
- Accuracy (by replicate): 60/76 (78.95%)
Details of how we process metatranscriptomes can be found here.
- Collect poly(A)-selected metatranscriptomes.
- Trim, quality control, and de novo assemble RNA sequences.
- Map transcripts to de novo assemblies. We used
kallistoestimated counts (est_counts). - Functionally annotate transcripts using the Pfam database with
hmmsearch(E-value < 1e-05). - Taxonomically annotate assemblies. We used
Diamond last common ancestor, using the Marine Functional EukaRyotic Reference Taxa (MarFERReT) reference sequence library (E-value < 1e-05). - Sum the estimated number of reads mapped to each contig by taxonomic annotation and sample (metatranscriptome).
- Normalize sequence reads to transcripts per million (TPM):
- Divide the estimated number of reads mapped to each contig by its nucleotide length (in kilobases) to generate reads per kilobase (RPK).
- Sum the RPK by species and sample, then divide by one million to generate a conversion factor.
- Divide the RPK by the conversion factor to calculate transcripts per million per contig.
- Sum transcripts per million by Pfam for each species and sample.
- Filter species bins: Retain only transcript bins identified as protists and at the species level.
- Create a data frame:
Fill in missing Pfams for a species, sample pair with0.
| Pfam1 | Pfam2 | Pfam3 | ... | |
|---|---|---|---|---|
| Species1_sample1 | TPM | TPM | TPM | TPM |
| Species2_sample1 | TPM | TPM | TPM | TPM |
| Species1_sample2 | TPM | TPM | TPM | TPM |
-
Create the Conda environment for MarPRISM
conda env create -f MarPRISM_environment.mlk.yml
-
Activate the Conda environment for MarPRISM
conda activate MarPRISM
-
Run your dataframe through Jupyter Notebook
MarPRISM.ipynb:- Open the Jupyter Notebook
MarPRISM.ipynband follow the instructions within to process the DataFrame and make predictions.
jupyter-notebook MarPRISM.ipynb
-
MarPRISM will output a warning and not make predictions for species bin sample pairs with <70% of eukaryote core transcribed genes (CTGs) expressed.
- Trophic predictions are not reliable for species bins with low coverage.
- The eukaryote CTGs
MarFERReT.v1.core_genes_eukaryota.csvare from MarFERReT.v1.core_genes.csv, filtering for lineage Eukaryota.
-
Files:
- Training Data:
trainingDataMarPRISM.csv - Features:
MarPRISM_featurePfams.csv - Example DataFrame:
exampleDataset.csv - Eukaryote CTGs:
MarFERReT.v1.core_genes_eukaryota.csv
- Training Data:
-
Output:
- To check that your Jupyter Notebook is working correctly, if
exampleDataset.csvis used as input, your outputexampleDataset_trophicPredictions.csvshould match
exampleDataset_trophicPredictions_toCompare.csv.
- To check that your Jupyter Notebook is working correctly, if
- Open the Jupyter Notebook
-
Deactivate the Conda environment for MarPRISM
conda deactivate
-
Filter predictions based on replicate consistency:
- Exclude phototrophy and heterotrophy predictions that are evenly split between replicate metatranscriptomes for the same species bin.
- For non-diel samples, we excluded instances where >25% of trophic predictions across replicates for one species bin fell into phototrophy and heterotrophy categories.
- For diel samples, we excluded instances where >25% of trophic predictions across samples and timepoints in a day for one species bin fell into phototrophy and heterotrophy categories.
-
Prioritize replicate-supported predictions:
- When interpreting results, put more trust in trophic predictions when multiple replicates give you the same trophic mode prediction.
We ran MarPRISM on processed metatranscriptomes collected across the North Pacific Ocean:
- Surface
- Euphotic depths
- Diel cycle
- Onboard nutrient amendment incubations
Most of these metatranscriptomes came from the Gradients (G) cruises.
G1-G3 surface, ALOHA diel, and G3 diel metatranscriptomes were processed for the North Pacific Eukaryotic Gene Catalog.
Scripts for proceasing G2 incubations and G3 depth profiles are available in processMarineMetatranscriptomes.
Estimated counts outputted by kallisto were converted to transcripts per million as outlined in previous section.
- Transcripts per million for each set of samples can be found here.
- G1PA.tpm_counts.csv (G1 surface transect)
- G2PA.tpm_counts.csv (G2 surface transect)
- G3PA.tpm_counts.csv (G3 surface transect)
- G3PA_depth.tpm_counts.csv (G3 depth profiles)
- D1PA.tpm_counts.csv (ALOHA diel study)
- G3PA_diel.tpm_counts.csv (G3 diel study)
- G2PA_incubations.tpm_counts.csv (G2 onboard nutrient amendment incubations)
conda env create -f MarPRISM_environment.mlk.yml
conda activate MarPRISM
jupyter-notebook MarPRISM.ipynb
#substitute exampleDataset.csv in MarPRISM.ipynb for one of the above datasets
conda deactivateexampleDataset_trophicPredictions.csvwhich contains the trophic predictions for each taxonomic bin and sample pair that have ≥70% eukaryote core transcribed genes expressed.
Then we retained only trophic predictions for transcript bins identified as protists and at the species level.
We excluded predictions split between phototrophy and heterotrophy:
- For non-diel samples, exclude instances where >25% of trophic predictions across replicates for one species bin were in both phototrophy and heterotrophy.
- For diel samples, exclude instances where >25% of trophic predictions across samples and timepoints in a day for one species bin were in both phototrophy and heterotrophy.
- Phototrophic transcriptomes were undersampled to select features
- Datasets used for feature selection can be found here.
trainingData_contamLowSeqsRemoved_50phottrainingData_contamLowSeqsRemoved_80phottrainingData_contamLowSeqsRemoved_100phottrainingData_contamLowSeqsRemoved_120phot
conda env create -f MarPRISM_environment.mlk.yml
conda activate MarPRISM
cd modelDevelopmentTesting
#find feature Pfams for xgboost model
python mda.py -model xgboost -data trainingData_contamLowSeqsRemoved_50phot -o features_contamLowSeqsRemoved_50phot_xg
python mda.py -model xgboost -data trainingData_contamLowSeqsRemoved_80phot -o features_contamLowSeqsRemoved_80phot_xg
python mda.py -model xgboost -data trainingData_contamLowSeqsRemoved_100phot -o features_contamLowSeqsRemoved_100phot_xg
python mda.py -model xgboost -data trainingData_contamLowSeqsRemoved_120phot -o features_contamLowSeqsRemoved_120phot_xg
conda deactivateWe then took the union of Pfams in features_contamLowSeqsRemoved_50phot_xg, features_contamLowSeqsRemoved_80phot_xg, features_contamLowSeqsRemoved_100phot_xg, and features_contamLowSeqsRemoved_120phot_xg that had an importance score greater than 0.
- Phototrophic transcriptomes were undersampled to find best performing hyperparameters
- Dataset with 100 phototrophic transcriptomes and all mixotrophic and heterotrophic transcriptomes, located here.
-
trainingData_contamLowSeqsRemoved_100phot
conda env create -f MarPRISM_environment.mlk.yml
conda activate MarPRISM
cd modelDevelopmentTesting
#find best performing hyperparameters for xgboost model
python parameter_gridsearch.py -model xgboost -data trainingData_contamLowSeqsRemoved_100phot/data_contamLowSeqsRemoved_phot100.csv -labels trainingData_contamLowSeqsRemoved_100phot/labels_contamLowSeqsRemoved_phot100.csv -o hyperparameters_contamLowSeqsRemoved_100phot_xg
conda deactivateWe used cross-validation to quantify the performance of MarPRISM as well as other model versions.
conda env create -f MarPRISM_environment.mlk.yml
conda activate MarPRISM
cd modelDevelopmentTesting
jupyter-notebook crossValidation.ipynb
conda deactivate- 'model_overall_f1_scores.csv': the overall mean and standard error of the F1 score for each model tested.
- 'models_byClass_f1_scores.csv': the mean and standard error of the F1 score by trophic mode for each model tested.
- 'marPRISM_k_train_size_vs_f1_score_by_class.csv': for MarPRISM, the mean and standard error of the F1 score by percentage of training data used.
- 'marPRISM_cumulative_confusion_matrix.csv': for MarPRISM, data for confusion matrix, summarizing performance across folds of cross-validation.
cd modelDevelopmentTesting
conda env create -f environment.mlk.yml
conda activate environment
jupyter-notebook lambertModel_crossValidation.ipynb
conda deactivate- 'lambert_model_overall_f1_score.csv': the overall mean and standard error of the F1 score for the previous version of the model (Lambert et al. 2022).
- 'lambert_model_overall_f1_score_byClass.csv': the mean and standard error of the F1 score by trophic mode for the previous version of the model (Lambert et al. 2022).
- 'lambert_model_cumulative_confusion_matrix.csv': for the previous version of the model (Lambert et al. 2022), data for confusion matrix, summarizing performance across folds of cross-validation. Cross-validation is not changed but confusion matrix predictions are only included for entries in trainingDataMarPRISM.csv so that results can be compared to results using MarPRISM.
testTranscriptomes.csv.gzlocated here- Has transcript per million counts for the test transcriptomes.
- Transcriptomes used for testing are from publicly available sources: accession IDs for transcriptomes used to test MarPRISM are available in
testTranscriptomes.xlsxlocated here.
conda env create -f MarPRISM_environment.mlk.yml
conda activate MarPRISM
jupyter-notebook MarPRISM.ipynb
#substitute exampleDataset.csv for testTranscriptomes.csv
conda deactivateexampleDataset_trophicPredictions.csvhas the trophic mode predictions for the test transcriptomes.
The trophic mode predictions can then be compared to their expected trophic mode: testTranscriptomes.xlsx.