This project predicts mRNA ribosome density (a.k.a. translation efficiency or TE) and decay rates in Drosophila melanogaster using a fine-tuned RNA Language Model (LLM). It integrates sequence information from 3' UTRs with additional features like codon usage and RNA secondary structure stability.
- Preprint: A preprint describing this work is available on bioRxiv: 10.64898/2025.12.04.692303v1
-
Goal: Predict ribosome density and mRNA decay from sequence data.
-
Model: Extended the pretraining of GenaLM Fly (BERT-based) on 5' & 3' UTR pairs and fine-tuned the model with a regression head.
-
Features:
- 5' & 3' UTR sequences
- Codon usage metrics
- Minimum Free Energy (MFE) from RNA folding (LinearFold)
- GC content and sequence length
-
Pipeline: Managed by DVC for reproducibility, covering data download, preprocessing, feature extraction, and model training.
-
Decay analysis: Detailed decay-rate analyses are kept on the
decaybranch of this repository (see thedecaybranch for notebooks and results).
Apptainer, conda, and DVC must be installed on your system and in your path.
This DVC pipeline will build the necessary conda environment using the provided environment.yaml.
To reproduce the pipeline run the following command:
dvc repro
dvc.yaml: Pipeline definition.params.yaml: Configuration parameters.scripts/: Source code for data processing and training.notebooks/: Exploratory analysis and visualization.