Skip to content

A general technique to reduce memory consumption and improve the performance of gap-affine and dual gap-affine sequence alignment algorithms

Notifications You must be signed in to change notification settings

LorienLV/singletrack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Singletrack

Given two biological sequences, a query sequence (q) of length n and a target sequence (t) of length m, the goal of sequence alignment is to find a sequence of operations that maps one sequence to another while minimizing a given cost function. Classical dynamic programming (DP) algorithms solve this problem by computing and storing multiple DP matrices, three for gap-affine and five for dual gap-affine alignments, each of size (n+1) × (m+1). These matrices are later traced back to recover the optimal alignment.

This repository contains implementations of the Singletrack backtrace algorithm presented here. Singletrack enables efficient backtracking in gap-affine and dual gap-affine sequence alignment requiring storing only a single DP matrix while preserving exact alignment results. This reduces memory usage by 3× for gap-affine and 5× for dual gap-affine alignments and improves memory hierarchy utilization on modern hardware, often translating into faster execution.

Implementations

The repository provides three implementations of Singletrack, each with its own README explaining usage:

  • basic: A basic C++ implementation of the classical dynamic programming algorithm with and without using Singletrack.
  • WFA2-lib: Singletrack integrated into WFA2-lib, commit 2ec28919af3a1b545acfb38e9bdefd160a87f266.
  • KSW2: Singletrack integrated into KSW2, commit 289609bd9e5381a13b16239d0a7703f1ff03f9ca.

Datasets

The datasets used in our paper are available in Zenodo.

All aligners in this repository expect a dataset file containing one sequence per line, organized as pairs. The first sequence in each pair (target) starts with >, and the second (query) starts with <. For example:

>AA
<AG
>ACAC
<CTG

Generate a Synthetic Dataset

This repository includes a tool for generating synthetic sequence datasets, located in tools/python.

Usage example:

python3 tools/python/generate_dataset.py -n 1000 -l 100 -e 0.1

This command generates a dataset of 1,000 pairs of DNA sequences, each approximately 100 bases long, with an error rate of 10%. The error rate indicates the proportion of edit operations (insertions, deletions, or substitutions) needed to transform one sequence into its paired sequence.

Cite Us

Lorién López-Villellas, Cristian Iñiguez, Albert Jiménez-Blanco, Quim Aguado-Puig, Miquel Moretó, Jesús Alastruey-Benedé, Pablo Ibáñez, Santiago Marco-Sola. Singletrack: An Algorithm for Improving Memory Consumption and Performance of Gap-Affine Sequence Alignment, bioRXiv, 2025.

@article{LpezVillellas2025,
  title = {Singletrack: An Algorithm for Improving Memory Consumption and Performance of Gap-Affine Sequence Alignment},
  url = {http://dx.doi.org/10.1101/2025.10.31.685625},
  DOI = {10.1101/2025.10.31.685625},
  publisher = {Cold Spring Harbor Laboratory},
  author = {López-Villellas,  Lorién and Iñiguez,  Cristian and Jiménez-Blanco,  Albert and Aguado-Puig,  Quim and Moretó,  Miquel and Alastruey-Benedé,  Jesús and Ibáñez,  Pablo and Marco-Sola,  Santiago},
  year = {2025},
  month = nov
}

About

A general technique to reduce memory consumption and improve the performance of gap-affine and dual gap-affine sequence alignment algorithms

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •