Skip to content

at-cg/HALE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

HALE (Haplotype-Aware Long-read Error correction) is a haplotype-aware error correction tool designed for long reads. It works with both ONT Simplex and PacBio HiFi data.

Requirements

  • Linux OS (tested on RHEL 8 and Ubuntu 22.04)

  • rustup (Install using rustup, the recommended way to get Rust)

  • Python 3.1 or above (and conda) for data preprocessing

  • Make sure the following system packages are installed (Linux):

    • build-essential, autoconf, libtool, pkg-config

    Note: If you're using a Linux system, there's a good chance these system packages are already installed, especially if development tools have been previously set up.

Try HALE on Small Test Data

The entire test workflow below will take about 6-8 minutes. Users can either run the commands one by one or copy the commands into an executable script.

# Install HALE 
git clone https://github.com/at-cg/HALE.git
cd HALE && RUSTFLAGS="-Ctarget-cpu=native" cargo build -q --release

# Create conda env
conda env create --file scripts/hale-env.yml
conda activate hale

mkdir -p test_run && cd test_run/

# download small test dataset
wget -O HG002.chr19_10M_12M.fastq.gz https://zenodo.org/records/14048797/files/HG002.chr19_10M_12M.fastq.gz?download=1

# Run hale correction
../target/release/hale correct --reads HG002.chr19_10M_12M.fastq.gz --threads 16 --depth 40 --ploidy 2 --tech ont

For large inputs, users are recommended to increase the thread count depending on the number of the cores available for use. HALE takes about 44 minutes for all-vs-all overlap and 12 minutes for correction consuming ~90 GB RAM using 64 threads on a multicore Perlmutter CPU-based node to process 60x HiFi chr9 HG002 human genome dataset.

Installation

  1. Clone the repository:
git clone https://github.com/at-cg/HALE.git
  1. Compile the source code:
cd HALE
RUSTFLAGS="-Ctarget-cpu=native" cargo build -q --release
  1. Create conda env
conda env create --file scripts/hale-env.yml
conda activate hale

Usage

$hale_dir/target/release/hale correct --reads <input_fastq/input_fastq.gz> --threads <num_of_threads> --depth <dataset_depth> --ploidy <genome_ploidy> --tech <hifi/ont>

Output file : hale_corrected_<input_fastq/input_fastq.gz> will be created in the same directory.

Note:

  • $hale_dir is the path to HALE directory
  • Flag --depth represent dataset dept, (default 60x), --ploidy represent ploidy of genome, (default 2)
  • Temporary alignment files are removed automatically upon successful completion.

Implementation Notes:

Automatic pipeline selection

HALE automatically selects the appropriate correction pipeline based on the sequencing technology:

  • PacBio HiFi

    • Single round of all-vs-all overlap
    • One HALE correction step
  • ONT Simplex

    • Three rounds of all-vs-all overlap
    • Two pre-correction rounds (pih mode)
    • One final correction round (hale mode)

No additional flags are required beyond --tech.

Dependencies (handled via conda)

The conda environment installs:

  • minimap2
  • seqkit
  • samtools
  • Python dependencies for preprocessing

Acknowledgement

This work leverages components of HERRO, developed by Stanojevic et al. (2024). While we designed a new algorithm independent of HERRO's deep learning approach, we adopted key preprocessing steps such as Minimap2 alignment, and post-processing for consensus generation with a few modifications.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •