mach-1.5-prep

This repository contains scripts for downloading and preparing reference annotations and genomes from the NCBI RefSeq database. The processed data is used to train models in the mach-1.5-savanna repository.

Overview

scripts/: Contains all preprocessing and utility scripts, including:
- Downloading reference genomes and annotations (download_references.py)
- Preparing datasets with or without UTRs (prepare_refseq*.R)
- Generating phylogenetic codes, computing stats, and sampling subsets
- Splitting data and converting it to .jsonl.zst format (csv_gz_to_jsonl_zst.sh)
data/refseq/: Directory for storing downloaded and processed RefSeq data, organized into:
- annotation/, genome/, metadata/, sequences/, etc.

Output Format

The final output is a split dataset in .jsonl.zst compressed format compatible with the Savanna training framework.

For training with Mach-1.5, see goodarzilab/mach-1.5-savanna.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mach-1.5-prep

Overview

Output Format

About

Uh oh!

Releases

Packages

Languages

csglab/mach-1.5-prep

Folders and files

Latest commit

History

Repository files navigation

mach-1.5-prep

Overview

Output Format

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages