Skip to content

A sequence-based machine-learning framework for predicting genes

Notifications You must be signed in to change notification settings

AIBreeding/DCGP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DCGP

Domestication Convergence Gene Prediction

DCGP DCGP is a sequence-based machine-learning framework for predicting genes under convergent selection during domestication across 43 plant species. Built upon the standardized comparative-genomic resource PhytoPop20K, DCGP enables cross-species inference of domestication convergence using sequence-encoded features alone, without requiring population-genetic statistics or allele-frequency contrasts. The model is trained using labeled convergence-selection genes curated in PhytoPop20K and sequence embeddings generated by Evo 2, which summarize nucleotide-level information into quantitative features suitable for machine learning. Once trained, DCGP can be applied to new genomes by simply passing gene sequences through the model, bypassing the need for population-level data, demographic modeling, or selection scans.

Related Software and Tools

  • DNNGP – Deep neural network for genomic prediction.
  • EXGEP – A framework for predicting genotype-by-environment interactions using ensem)bles of explainable machine-learning models.
  • GxEtoolkit – An automated and explainable machine learning framework for Genome Prediction.
  • BDP-identifier – Genomic Language Model-Based Prediction of Bidirectional Promoter Activity.
  • KANMB – A machine learning training and prediction tool based on KAN (Kolmogorov-Arnold Network) for identifying optimal metabolites from metabolite expression data.

🏁Table of Contents

Getting started

Requirements

  • python 3.11
  • conda/pip

Installation

Install packages:

  1. Create a python environment.
conda create -n dcgp python=3.11
conda activate dcgp
  1. Clone this repository and cd into it.
git clone https://github.com/AIBreeding/DCGP.git
cd ./DCGP
conda env create -f environment.yml -n dcgp
  1. If the installation above is unsuccessful, please refer to the Evo2 Official Installation Guide (https://github.com/arcinstitute/evo2) to resolve any dependency or installation issues.

Usage

1. Use Evo2 to obtain sequence data embedding

python ./embedding.py \
        --fasta ./sequence/sample.fa \
        --step 250 \
        --window 1024 \
        --layer "blocks.24" \
        --save_txt ./sequence/embedding.txt \
        --out ./sequence/sa.pt

2. Use the trained DCGP model for prediction

python ./predict.py \
        --data ./squence \
        --mdoel ./models/ \
        --out ./results/pred \
        --batch_size 256

📜Copyright and License

This project is free to use for non-commercial purposes - see the LICENSE file for details.

👥Contacts

For more information, please contact with Huihui Li (lihuihui@caas.cn).

About

A sequence-based machine-learning framework for predicting genes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages