This repository is designed for detecting transposable element (TE) insertions using short-read sequencing data. It enables both the assessment of annotated TE presence/absence in the reference genome and the detection and genotyping of novel insertions not represented in the reference. The repository includes a Snakemake pipeline used for training, deploying, and benchmarking TE insertion detection models based on random forest classifiers. Please note that this is a beta release, and we are actively working to improve usability.
-
Snakemake Pipeline
- The main pipeline is controlled by
workflow/Snakefile. - It orchestrates rules for:
- Data preprocessing
- Model training
- Model deployment (prediction)
- Benchmarking
- The new user-facing training pipeline lives in
workflow/trainingand is launched viaTEforest_train.py. Some legacy training/benchmarking scripts still contain hard-coded paths.
- The main pipeline is controlled by
-
Scripts
- Supporting scripts for training and model deployment are located in
workflow/scripts.
- Supporting scripts for training and model deployment are located in
-
Pre-Trained Models
- Trained Random Forest models are stored in
workflow/models.
- Trained Random Forest models are stored in
-
Outputs
-
Final output files (predictions) will be generated in your chosen work directory under the path:
outputs/{sample}_TEforest_bps_nonredundant.bed outputs/{sample}_TEforest_bps_nonredundant.vcf -
Candidate regions prior to precise breakpoint detection are also avaliable:
outputs/{sample}_TEforest_nonredundant.bed
-
A Conda environment YAML file (TEforest.yml) is included for convenience. It defines the base dependencies for running this pipeline. However, note that additional R packages such as GenomicAlignments and GenomicRanges should be installed to ensure full functionality. A full Conda installation is under development and will be released shortly.
# Example environment creation
conda env create -f TEforest.yml
# Activate the environment
conda activate TEforest
# Then install additional R packages within this environment.A Python script named TEforest.py is provided for launching the pipeline. This script will run the trained models on one or more genomes of your choice.
Note: Ongoing usability developments may change the command-line arguments or environment requirements in the future.
python TEforest.py \
--workflow_dir <path/to/workflow> \
--workdir <path/to/desired_workdir> \
--threads 16 \
--consensusTEs <path/to/consensusTEs.fasta> \
--ref_genome <path/to/reference_genome.fasta> \
--ref_te_locations <path/to/te_locations.bed> \
--euchromatin <path/to/euchromatin.bed> \
--model <path/to/pretrained_model.pkl> \
--ref_model <path/to/reference_model.pkl> \
--fq_base_path <path/to/fastq/files> \
--samples A1 A2 A3--workflow_dir: Directory containing theSnakefile(workflow/Snakefile).--workdir: Directory to store outputs and logs.--threads: Number of CPU threads to use. 16 per sample is recommended.--consensusTEs,--ref_genome,--ref_te_locations,--euchromatin: Input reference files for TE detection. All calls outside of the regions denoted in euchromatin will be filtered. Example files used for Drosophila melanogaster are located in example_files/. Be aware the BWA-mem2 will treat IUPAC bases as missing, so TEforest may have reduced performance on consensus sequences with high IUPAC content.--model: Path to the non-reference random forest model. Select a model that best matches the coverage of your reads. Alignments are automatically downsampled to the nearest available coverage—whichever is immediately lower than your average—using one of the following trained models: 5X, 10X, 20X, 30X, 40X, or 50X.--ref_model: Path to the reference random forest model.--fq_base_path: Directory containing FASTQ files. Should contain sample in name formatted {sample}_1.fastq.gz and {sample}_2.fastq.gz or {sample}_1.fq.gz.--samples: List of sample identifiers to process (space-separated). Note more than one sample can be run in parallel.
The script will generate:
- A
config.yamlin your specifiedworkdirwith all parameters. - Intermediate files used to run the pipeline
- Output
.bedfiles ({sample}_TEforest_bps_nonredundant.bed) for each sample inoutput/within the specifiedworkdir. - Output
.vcffiles ({sample}_TEforest_bps_nonredundant.vcf) for each sample inoutput/within the specifiedworkdir.
VCF notes:
- Non-reference calls are represented as INS records with ALT
<INS:ME:<TEFAM>>. - Present reference TEs are represented as DEL records with GT=0/0 (absence will be GT=1/1 once deletion calling is added).
Use TEforest_train.py with the training Snakefile at workflow/training/Snakefile to train a model on your own data. This pipeline labels candidate regions by overlap with a truth BED and trains a LightGBM classifier. Candidate regions that do not overlap a truth locus are labeled absent (0). If you are interested in training on one region of the genome,be sure to use the --euchromatin flag to denote the regions you want to include (otherwise all regions of the genome will be considered for candidate regions).
Truth BED requirements:
- Columns:
seqnames,start,end,zygosity,sample(and optionaltefamily). zygosityvalues are interpreted as 1, 0.5, or 0; in classifier mode these are mapped to classes 2/1/0.
Example usage:
python TEforest_train.py \
--workflow_dir /path/to/TEforest/workflow/training \
--workdir /path/to/training_workdir \
--threads 32 \
--consensusTEs /path/to/consensusTEs.fasta \
--ref_genome /path/to/reference_genome.fasta \
--ref_te_locations /path/to/ref_te_locations.bed \
--euchromatin /path/to/euchromatin.bed \
--truth_bed /path/to/truth_labels.bed \
--fq_base_path /path/to/fastqs \
--samples SAMPLE1 SAMPLE2 \
--label_mode classifier \
--train_mode classifier \
--test_size 0.1 \
--model_out training/teforest_classifier.pklNotes:
--label_mode classifiermaps 1/0.5/0 to 2/1/0;regressorkeeps the numeric zygosity values.--test_sizecontrols the train/test split (0 disables the split).- Output model path is relative to the workdir unless you give an absolute path.
A small dataset is bundled under tests/data/simulate_te_insertions so you can sanity-check a new install after activating the TEforest conda environment.
# from the repo root
./tests/run_example.shThis runs TEforest on RL150IS400_rep0_fwd and compares the breakpoint output to the expected call. The expected breakpoint is:
3L 5001 5005 roo|non-reference|1|RL150IS400_rep0_fwd|TEforest|rp|1 0 .
You can also override the work directory:
./tests/run_example.sh /tmp/teforest_example- We are actively working on usability improvements over the next several months (e.g., streamlined CLI arguments, a fully functional conda installation, automatic model selection, faster runtime).
- This pipeline has only been tested in Drosophila melanogaster. Future tests and updates will ensure useability in other species.
- Issues and pull requests are welcome.