Skip to content

Releases: bede/deacon

0.13.2

21 Nov 14:17

Choose a tag to compare

Added

  • deacon filter accepts --rename-random for anonymising read names using both incrementing and random (64 bit) integers to practically assure uniqueness. This mitigates a reported issue relating to identical read names across separate files causing problems during upload to the European Nucleotide Archive (#67, @nick297).

Changed

  • Groundwork for ensuring x86 Bioconda binaries are always built for the most portable x86-64-v3 (AVX2) target supporting all AMD and Intel CPUs released in the last decade (#71).
  • Use ensure_simd for compilation and runtime checks for SIMD (@RagnarGrootKoerkamp).
  • The crate minreq now uses the rustls backend enabling native compilation on a wider range of Linux systems. Implements a simple workaround for the previously described rustls & ring issue on MacOS ARM runners.

0.13.1

14 Nov 20:39

Choose a tag to compare

deacon index fetch now uses minreq rather than ureq to download indexes, removing dependency on rustls & ring, which caused a curious build error on Bioconda arm64 MacOS runners, meaning that 0.13.0 was released via Cargo only.

0.13.0

11 Nov 20:23

Choose a tag to compare

Added

  • Command deacon index fetch for downloading prebuilt indexes by name. If no index name is specified, the panhuman-1 index is downloaded by default.
  • Parallel gzip compression of output files with automatic ~1:1 thread allocation between filtering and compression tasks if .gz extensions for --output (-o) or --output2 (-O) arguments are detected.
    • Automatic thread allocation can be overriden using new --compression-threads argument.
    • ~3x faster filtering when reading and writing gzip-compressed Illumina FASTQs.

0.12.0

16 Oct 14:40

Choose a tag to compare

Added

  • Command deacon index intersect for finding the intersection of two or more minimizer indexes.
  • Command deacon index dump for extracting minimizers from an index as plain text FASTA.
  • Command deacon cite showing citation info.

Changed

  • Graceful handling of empty compressed input files (@pmenzel).
  • Fixes bug where --debug mode incorrectly showed the complement of the hitting minimizer.
  • Uses paraseq 0.4.3, addressing a bug identified in paraseq 0.4.2 causing FASTQ records without a trailing newline byte to be ignored.
  • 2x increase in filtering throughput on arm64 / MacOS systems enabled by a series of optimisations in latest versions of libraries packed-seq, seq-hash and simd-minimizers (@RagnarGrootKoerkamp).

0.11.0

07 Oct 16:02

Choose a tag to compare

Incorporates new features, fixes and peformance optimisations. Includes many PRs from @RagnarGrootKoerkamp, taking advantage of new features in simd-minimizers, packed-seq and paraseq (@imartayan, @noamteyssier) for improved correctness when handling seqs containing non-ACGT nucleotides. Filtering results are otherwise identical to 0.10.0. This release introduces a new index format that is smaller on disk, faster to load, and stores k-mers directly rather than k-mer hashes, providing flexibility for future performance optimisation without breaking index compatibility.

Added

  • Local (socket) server mode, enabling successive filter commands to be handled by a persistent server process for low latency filtering (@RagnarGrootKoerkamp).
  • Support for longer k-mers of up to length 61, where k+w ≤ 96 (packed-seq 4.1.1).

Changed

  • Faster filtering (x86_64)
  • Faster paired read filtering, particularly from separate input files which are now decompressed in parallel.
  • Faster indexing and index loading
  • While minimizers containing non-ACGT nucleotides were already discarded, minimizer selection could still be influenced by non-ACGT nucleotides present in the window, occasionally impacting results. Enabled by changes in simd-minimizers ≥ 2.0, entire windows containing non-ACGT nucleotides are discarded. Records containing non-ACGT nucleotides may therefore be classified differently to prior Deacon versions.
  • Redesigned index format (v3).
    • Index now stores 'concrete' k-mers using using 2*k bits rounded up to the nearest byte boundary, rather than always using 64bit xxh3 k-mer hashes.
    • Eliminates [tiny] risk of false positive matches caused by xxh3 collisions.
    • Serialised k-mers are byte-aligned, balancing efficient storage and deserialisation speed.
    • Index disk footprint reduced by 10%.
    • Paves way for painless future adoption of faster HashSet implementations.
    • Paves way for future index introspection functionality.
  • RapidHashSet (rapidhash::fast) replaces combined use of xxHash (xxh3) and FxHashSet
  • Fails gracefully given empty input files (#46) @pmenzel.
  • Bugfix for paired read I/O (#60) @funnell.
  • Feature gating for reduced compile times @RagnarGrootKoerkamp .

Removed

  • Removed --capacity argument, which was easily misused for little performance benefit.

0.10.0

01 Sep 15:25

Choose a tag to compare

0.9.0

15 Aug 18:19

Choose a tag to compare

Performance optimisations (#27) deliver up to 80% faster filtering with unchanged accuracy (@RagnarGrootKoerkamp).

  • >2Gbp/s with uncompressed long read input.
  • >500Mbp/s with gzip-compressed long read input.

0.8.1

14 Aug 15:22

Choose a tag to compare

  • Fixes bug handling paired reads introduced in 0.8.0 which could lead to mispaired read output (@KaibondChau)
  • Fixes bug handling multiline FASTA input introduced in 0.8.0 (@RagnarGrootKoerkamp)

0.8.0

11 Aug 15:11

Choose a tag to compare

  • Faster filtering on multicore systems through improved work allocation using the Paraseq library (@noamteyssier). Filtering at >1Gbp/s is possible with uncompressed long sequences, and >500Mbp/s is achievable on many systems with Gzip-compressed long reads. Filtering Illumina reads is roughly twice as fast as before in my testing at ~200Mbp/s.
  • Added independent absolute (-a) and relative (-r) match thresholds with respective default values of 2 and 0.01 (1%). The new default relative threshold improves search specificity for long sequences over the previous absolute-only default threshold, without affecting short read accuracy. These replace the previous dual purpose -m parameter which could accept either an absolute (integer) threshold or a relative (float) threshold.
  • Minimizers containing ambiguous nucleotides are now ignored.
  • deacon index now offers the ability to discard minimizers with information content below a specified scaled Shannon --entropy (-e) threshold. This is disabled by default.
  • deacon filter now has a --debug mode which prints all records with minimizer matches to stderr including the matched minimizer sequence(s).
  • The default worst-case hash table capacity preallocation used in deacon index union operations can now be overriden with the new --capacity (-c) argument, in similar fashion to deacon index build.

0.7.0

08 Jul 13:32
2a973f0

Choose a tag to compare

  • Deacon now uses the recently added simd-minimizers::iter_canonical_minimizer_values(), increasing filtering speed by up to 50% on Linux/x86_64 systems. Speeds of 1Gbp/s have been observed with uncompressed FASTA input. Thanks @RagnarGrootKoerkamp for a PR and improvements to simd-minimizers.
    • Index format is now version 2. Existing indexes must be rebuilt for use with this version. A new version of the panhuman-1 index is available from Zenodo and object storage. Attempting to load an incompatible index throws an error.
  • deacon index diff can now accept a fastx file or stream in place of a second index. This enables index masking using massive sequence collections without the need to first index them.
  • Position-dependent IUPAC ambiguous base canonicalisation was replaced with a simpler and faster fixed mapping, meaning that records containing ambiguous IUPAC bases may be classified differently to before.
  • deacon index union now automatically preallocates the required hash table capacity, eliminating slowdowns when combining indexes.
  • Compatible minimizer k and w is now validated (k+w-1 must be odd) prior to indexing.
  • Default index capacity is now 400M (Was 500M).