DeepFEPS is a high-performance bioinformatics platform for extracting advanced sequence-based features from DNA, RNA, and protein data. It integrates modern machine learning and deep learning techniques to transform raw biological sequences into rich numerical representations suitable for classification, clustering, and predictive modeling.
Each feature extractor below offers an advanced way of representing biological sequences — from sequence embedding models such as Word2Vec, FastText, and Doc2Vec, to Transformer-based architectures, Autoencoder-derived features, and Graph-based embeddings. These deep learning and graph representation techniques can capture complex sequence patterns and relationships beyond simple k-mer counts, enabling more powerful analysis for functional annotation, motif discovery, and predictive modeling.
Simply select the method that best fits your research goals, upload your sequences, configure the parameters, and download your processed features.
deepFEPS is a command‑line toolkit for extracting rich numerical features from DNA/RNA/protein sequences in FASTA format. It bundles five complementary extractors:
- 🧠 Autoencoder features — learned representations from k‑mer bag‑of‑words or fixed one‑hot encodings (PyTorch).
- 📄 Doc2Vec embeddings — PV‑DM / PV‑DBOW document embeddings over k‑mers (Gensim).
- 🕸️ Graph embeddings — DeepWalk / Node2Vec / Graph2Vec on k‑mer graphs (NetworkX + Gensim + KarateClub).
- 🤖 Transformer embeddings — ProtBERT / ESM2 / DNABERT / custom HF models (Hugging Face Transformers).
- 🔤 Word2Vec / FastText — train embeddings on k‑mers, pool to fixed‑size vectors (Gensim).
Outputs are CSV files (one row per sequence).
- 🐍 Python 3.10+ (tested with 3.10/3.11)
- 💻 OS: Linux or macOS recommended; Windows works but may require minor shell changes.
- 📦 Packages (managed via
requirements.txt):- gensim
- huggingface-hub
- karateclub
- networkx
- numpy
- torch
- tqdm
- transformers
- 🧮 CPU: All extractors work on CPU. Doc2Vec/Word2Vec/Graph are CPU‑friendly.
- ⚡ GPU (optional, recommended for Transformers/Autoencoder):
- 🧩 NVIDIA CUDA 11+ compatible GPU.
- 💾 ≥4 GB VRAM works for small transformer models (e.g., ESM2‑t6). 8–16 GB+ recommended for larger models and batched inference.
- 🐢 If you don’t have a GPU, everything runs on CPU; it will just be slower for Transformers.
💡 Tip: for servers without write access to default cache dirs, set a writable HF cache, e.g.:
export HF_HOME=/path/to/hf-cache export HF_HUB_DISABLE_PROGRESS_BARS=1
conda env create -f environment.yml
conda activate deepfeps
# GPU alternative:
# conda install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r requirements.txt
python scripts/transformer_embeddings.py -i examples/sample.fasta -o features.csv -t protein --method esm2 --pool meanAll extractors take FASTA files (-i accepts one or more files). DNA/RNA/protein is controlled by -t {dna,protein}. Compressed FASTA (.gz) is supported where noted by the scripts.
Example (examples/sample.fasta):
>seq1
ACGTACGTACGTACGTACGTACGTACGTACGT
>seq2
ACGTACGTACGTACGTACGTACGTACGTACGT
Learn a compressed representation from either k‑mer bag‑of‑words or a fixed one‑hot encoding.
Basic usage
python scripts/autoencoder_features.py -i my.fasta -o autoenc.csv -t dna --encoding kmer-bow -k 3 --normalize l2 --epochs 10 --batch-size 64 --device autoCommon options
--encodingkmer-bow|onehot-fixed— representation-k— k‑mer size (kmer-bow)--normalizenone|l1|l2--hidden-dims(e.g.,512 128) and--latent-dim--actrelu|gelu|tanh,--dropout,--batch-norm--epochs,--batch-size,--lr,--weight-decay--deviceauto|cpu|cuda, misc:--reverse-complement-merge,--uppercase,--minlen,--l2norm,--add-length,--progress
Treat each sequence as a document of k‑mers; learn PV‑DM / PV‑DBOW embeddings.
Basic usage
python scripts/doc2vec_embeddings.py -i my.fasta -o d2v.csv -t dna -k 3 --dm 1 --vector-size 100 --window 5 --epochs 20Common options
--dm {1,0},--dm-mean {1,0},--dm-concat {1,0}--vector-size,--window,--epochs,--min-count--negative,--hs {1,0},--sample,--seed,--workers--pretrained,--save-model,--infer-steps,--infer-alpha,--force-infer- Misc:
--reverse-complement-merge,--uppercase,--minlen,--l2norm,--add-length,--progress
Build a k‑mer graph (nodes=k‑mers, edges by co‑occurrence / overlaps), then embed via DeepWalk / Node2Vec / Graph2Vec.
Basic usage
python scripts/graph_embeddings.py -i my.fasta -o graph.csv -t dna -k 3 --method deepwalk --vector-size 128 --window 5 --epochs 5Common options
--methoddeepwalk|node2vec|graph2vec- Graph2Vec/skip‑gram params:
--vector-size,--window,--epochs - Misc:
--reverse-complement-merge,--uppercase,--minlen,--add-length,--progress
Requires:
networkx,gensim,karateclub.
Extract contextual embeddings using Hugging Face models.
Basic usage (protein)
python scripts/transformer_embeddings.py -i my.fasta -o esm2.csv -t protein --method esm2 --pool mean --batch-size 4 --device cpuBasic usage (DNA/DNABERT)
python scripts/transformer_embeddings.py -i my.fasta -o dnabert.csv -t dna --method dnabert --k-mer 6 --pool mean --batch-size 2 --device cpuCommon options
--methodprotbert|esm2|dnabert|auto--modeloverride HF model id/path (when--method auto)--k-mer(DNABERT)--poolcls|mean|mean+max,--batch-size,--device
Train Word2Vec or FastText on k‑mers and pool per sequence.
Basic usage
python scripts/w2v_ft_embeddings.py -i my.fasta -o w2v.csv -t dna -k 3 --algo word2vec --vector-size 100 --window 5 --epochs 10 --pool meanCommon options
--algoword2vec|fasttext,--vector-size,--window,--epochs--poolmean|sum|max|mean+max- Misc:
--reverse-complement-merge,--uppercase,--minlen,--l2norm,--add-length,--progress
CSV with id (FASTA header), optional class, then feature columns.
- 🎲 Set
--seedwhere available. - 🧱 For long sequences with BERT‑style models (positional limit ~512), pre‑truncate or run windows externally.
- 🗂️ Configure HF cache on servers (see above tip).
If you use deepFEPS in your research or teaching, please cite it as follows:
@software{deepfeps_2025,
author = {Ismail, Hamid},
title = {deepFEPS: Deep Learning-Oriented Feature Extraction for Biological Sequences},
year = {2025},
url = {https://github.com/hamiddi/deepFEPS}
}MIT (see LICENSE).