Skip to content

Scripts to download, preprocess, and compress NCBI RefSeq annotations and genomes for training Mach-1.5 models.

Notifications You must be signed in to change notification settings

csglab/mach-1.5-prep

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

mach-1.5-prep

This repository contains scripts for downloading and preparing reference annotations and genomes from the NCBI RefSeq database. The processed data is used to train models in the mach-1.5-savanna repository.

Overview

  • scripts/: Contains all preprocessing and utility scripts, including:

    • Downloading reference genomes and annotations (download_references.py)
    • Preparing datasets with or without UTRs (prepare_refseq*.R)
    • Generating phylogenetic codes, computing stats, and sampling subsets
    • Splitting data and converting it to .jsonl.zst format (csv_gz_to_jsonl_zst.sh)
  • data/refseq/: Directory for storing downloaded and processed RefSeq data, organized into:

    • annotation/, genome/, metadata/, sequences/, etc.

Output Format

The final output is a split dataset in .jsonl.zst compressed format compatible with the Savanna training framework.


For training with Mach-1.5, see goodarzilab/mach-1.5-savanna.

About

Scripts to download, preprocess, and compress NCBI RefSeq annotations and genomes for training Mach-1.5 models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published