This repository contains scripts for downloading and preparing reference annotations and genomes from the NCBI RefSeq database. The processed data is used to train models in the mach-1.5-savanna repository.
-
scripts/: Contains all preprocessing and utility scripts, including:- Downloading reference genomes and annotations (
download_references.py) - Preparing datasets with or without UTRs (
prepare_refseq*.R) - Generating phylogenetic codes, computing stats, and sampling subsets
- Splitting data and converting it to
.jsonl.zstformat (csv_gz_to_jsonl_zst.sh)
- Downloading reference genomes and annotations (
-
data/refseq/: Directory for storing downloaded and processed RefSeq data, organized into:annotation/,genome/,metadata/,sequences/, etc.
The final output is a split dataset in .jsonl.zst compressed format compatible with the Savanna training framework.
For training with Mach-1.5, see goodarzilab/mach-1.5-savanna.