HALE (Haplotype-Aware Long-read Error correction) is a haplotype-aware error correction tool designed for long reads. It works with both ONT Simplex and PacBio HiFi data.
-
Linux OS (tested on RHEL 8 and Ubuntu 22.04)
-
rustup (Install using rustup, the recommended way to get Rust)
-
Python 3.1 or above (and conda) for data preprocessing
-
Make sure the following system packages are installed (Linux):
- build-essential, autoconf, libtool, pkg-config
Note: If you're using a Linux system, there's a good chance these system packages are already installed, especially if development tools have been previously set up.
The entire test workflow below will take about 6-8 minutes. Users can either run the commands one by one or copy the commands into an executable script.
# Install HALE
git clone https://github.com/at-cg/HALE.git
cd HALE && RUSTFLAGS="-Ctarget-cpu=native" cargo build -q --release
# Create conda env
conda env create --file scripts/hale-env.yml
conda activate hale
mkdir -p test_run && cd test_run/
# download small test dataset
wget -O HG002.chr19_10M_12M.fastq.gz https://zenodo.org/records/14048797/files/HG002.chr19_10M_12M.fastq.gz?download=1
# Run hale correction
../target/release/hale correct --reads HG002.chr19_10M_12M.fastq.gz --threads 16 --depth 40 --ploidy 2 --tech ont
For large inputs, users are recommended to increase the thread count depending on the number of the cores available for use. HALE takes about 44 minutes for all-vs-all overlap and 12 minutes for correction consuming ~90 GB RAM using 64 threads on a multicore Perlmutter CPU-based node to process 60x HiFi chr9 HG002 human genome dataset.
- Clone the repository:
git clone https://github.com/at-cg/HALE.git- Compile the source code:
cd HALE
RUSTFLAGS="-Ctarget-cpu=native" cargo build -q --release- Create conda env
conda env create --file scripts/hale-env.yml
conda activate hale$hale_dir/target/release/hale correct --reads <input_fastq/input_fastq.gz> --threads <num_of_threads> --depth <dataset_depth> --ploidy <genome_ploidy> --tech <hifi/ont>Output file : hale_corrected_<input_fastq/input_fastq.gz> will be created in the same directory.
Note:
$hale_diris the path to HALE directory- Flag
--depthrepresent dataset dept, (default 60x),--ploidyrepresent ploidy of genome, (default 2) - Temporary alignment files are removed automatically upon successful completion.
HALE automatically selects the appropriate correction pipeline based on the sequencing technology:
-
PacBio HiFi
- Single round of all-vs-all overlap
- One HALE correction step
-
ONT Simplex
- Three rounds of all-vs-all overlap
- Two pre-correction rounds (
pihmode) - One final correction round (
halemode)
No additional flags are required beyond --tech.
The conda environment installs:
- minimap2
- seqkit
- samtools
- Python dependencies for preprocessing
This work leverages components of HERRO, developed by Stanojevic et al. (2024). While we designed a new algorithm independent of HERRO's deep learning approach, we adopted key preprocessing steps such as Minimap2 alignment, and post-processing for consensus generation with a few modifications.