Skip to content

SchriderLab/TEforest

Repository files navigation

TEforest

This repository is designed for detecting transposable element (TE) insertions using short-read sequencing data. It enables both the assessment of annotated TE presence/absence in the reference genome and the detection and genotyping of novel insertions not represented in the reference. The repository includes a Snakemake pipeline used for training, deploying, and benchmarking TE insertion detection models based on random forest classifiers. Please note that this is a beta release, and we are actively working to improve usability.

Overview

  • Snakemake Pipeline

    • The main pipeline is controlled by workflow/Snakefile.
    • It orchestrates rules for:
      • Data preprocessing
      • Model training
      • Model deployment (prediction)
      • Benchmarking
    • The new user-facing training pipeline lives in workflow/training and is launched via TEforest_train.py. Some legacy training/benchmarking scripts still contain hard-coded paths.
  • Scripts

    • Supporting scripts for training and model deployment are located in workflow/scripts.
  • Pre-Trained Models

    • Trained Random Forest models are stored in workflow/models.
  • Outputs

    • Final output files (predictions) will be generated in your chosen work directory under the path:

      outputs/{sample}_TEforest_bps_nonredundant.bed
      outputs/{sample}_TEforest_bps_nonredundant.vcf
    • Candidate regions prior to precise breakpoint detection are also avaliable:

      outputs/{sample}_TEforest_nonredundant.bed

Environment Setup

A Conda environment YAML file (TEforest.yml) is included for convenience. It defines the base dependencies for running this pipeline. However, note that additional R packages such as GenomicAlignments and GenomicRanges should be installed to ensure full functionality. A full Conda installation is under development and will be released shortly.

# Example environment creation
conda env create -f TEforest.yml

# Activate the environment
conda activate TEforest

# Then install additional R packages within this environment.

How to Launch

A Python script named TEforest.py is provided for launching the pipeline. This script will run the trained models on one or more genomes of your choice.

Note: Ongoing usability developments may change the command-line arguments or environment requirements in the future.

Basic Usage

python TEforest.py \
    --workflow_dir <path/to/workflow> \
    --workdir <path/to/desired_workdir> \
    --threads 16 \
    --consensusTEs <path/to/consensusTEs.fasta> \
    --ref_genome <path/to/reference_genome.fasta> \
    --ref_te_locations <path/to/te_locations.bed> \
    --euchromatin <path/to/euchromatin.bed> \
    --model <path/to/pretrained_model.pkl> \
    --ref_model <path/to/reference_model.pkl> \
    --fq_base_path <path/to/fastq/files> \
    --samples A1 A2 A3
  • --workflow_dir: Directory containing the Snakefile (workflow/Snakefile).
  • --workdir: Directory to store outputs and logs.
  • --threads: Number of CPU threads to use. 16 per sample is recommended.
  • --consensusTEs, --ref_genome, --ref_te_locations, --euchromatin: Input reference files for TE detection. All calls outside of the regions denoted in euchromatin will be filtered. Example files used for Drosophila melanogaster are located in example_files/. Be aware the BWA-mem2 will treat IUPAC bases as missing, so TEforest may have reduced performance on consensus sequences with high IUPAC content.
  • --model: Path to the non-reference random forest model. Select a model that best matches the coverage of your reads. Alignments are automatically downsampled to the nearest available coverage—whichever is immediately lower than your average—using one of the following trained models: 5X, 10X, 20X, 30X, 40X, or 50X.
  • --ref_model: Path to the reference random forest model.
  • --fq_base_path: Directory containing FASTQ files. Should contain sample in name formatted {sample}_1.fastq.gz and {sample}_2.fastq.gz or {sample}_1.fq.gz.
  • --samples: List of sample identifiers to process (space-separated). Note more than one sample can be run in parallel.

The script will generate:

  • A config.yaml in your specified workdir with all parameters.
  • Intermediate files used to run the pipeline
  • Output .bed files ({sample}_TEforest_bps_nonredundant.bed) for each sample in output/ within the specified workdir.
  • Output .vcf files ({sample}_TEforest_bps_nonredundant.vcf) for each sample in output/ within the specified workdir.

VCF notes:

  • Non-reference calls are represented as INS records with ALT <INS:ME:<TEFAM>>.
  • Present reference TEs are represented as DEL records with GT=0/0 (absence will be GT=1/1 once deletion calling is added).

Training pipeline (user-provided truth BED)

Use TEforest_train.py with the training Snakefile at workflow/training/Snakefile to train a model on your own data. This pipeline labels candidate regions by overlap with a truth BED and trains a LightGBM classifier. Candidate regions that do not overlap a truth locus are labeled absent (0). If you are interested in training on one region of the genome,be sure to use the --euchromatin flag to denote the regions you want to include (otherwise all regions of the genome will be considered for candidate regions).

Truth BED requirements:

  • Columns: seqnames, start, end, zygosity, sample (and optional te family).
  • zygosity values are interpreted as 1, 0.5, or 0; in classifier mode these are mapped to classes 2/1/0.

Example usage:

python TEforest_train.py \
    --workflow_dir /path/to/TEforest/workflow/training \
    --workdir /path/to/training_workdir \
    --threads 32 \
    --consensusTEs /path/to/consensusTEs.fasta \
    --ref_genome /path/to/reference_genome.fasta \
    --ref_te_locations /path/to/ref_te_locations.bed \
    --euchromatin /path/to/euchromatin.bed \
    --truth_bed /path/to/truth_labels.bed \
    --fq_base_path /path/to/fastqs \
    --samples SAMPLE1 SAMPLE2 \
    --label_mode classifier \
    --train_mode classifier \
    --test_size 0.1 \
    --model_out training/teforest_classifier.pkl

Notes:

  • --label_mode classifier maps 1/0.5/0 to 2/1/0; regressor keeps the numeric zygosity values.
  • --test_size controls the train/test split (0 disables the split).
  • Output model path is relative to the workdir unless you give an absolute path.

Quick install test (example dataset)

A small dataset is bundled under tests/data/simulate_te_insertions so you can sanity-check a new install after activating the TEforest conda environment.

# from the repo root
./tests/run_example.sh

This runs TEforest on RL150IS400_rep0_fwd and compares the breakpoint output to the expected call. The expected breakpoint is:

3L    5001    5005    roo|non-reference|1|RL150IS400_rep0_fwd|TEforest|rp|1    0    .

You can also override the work directory:

./tests/run_example.sh /tmp/teforest_example

Contributing

  • We are actively working on usability improvements over the next several months (e.g., streamlined CLI arguments, a fully functional conda installation, automatic model selection, faster runtime).
  • This pipeline has only been tested in Drosophila melanogaster. Future tests and updates will ensure useability in other species.
  • Issues and pull requests are welcome.

About

Finding TEs with short-read data using random forests

Resources

License

Stars

Watchers

Forks

Packages

No packages published