Releases · bede/deacon

21 Nov 14:17

bede

0.13.2

0b34ddf

0.13.2 Latest

Latest

Added

deacon filter accepts --rename-random for anonymising read names using both incrementing and random (64 bit) integers to practically assure uniqueness. This mitigates a reported issue relating to identical read names across separate files causing problems during upload to the European Nucleotide Archive (#67, @nick297).

Changed

Groundwork for ensuring x86 Bioconda binaries are always built for the most portable x86-64-v3 (AVX2) target supporting all AMD and Intel CPUs released in the last decade (#71).
Use ensure_simd for compilation and runtime checks for SIMD (@RagnarGrootKoerkamp).
The crate minreq now uses the rustls backend enabling native compilation on a wider range of Linux systems. Implements a simple workaround for the previously described rustls & ring issue on MacOS ARM runners.

Contributors

nick297 and RagnarGrootKoerkamp

Assets 2

14 Nov 20:39

bede

0.13.1

b08d7e6

0.13.1

deacon index fetch now uses minreq rather than ureq to download indexes, removing dependency on rustls & ring, which caused a curious build error on Bioconda arm64 MacOS runners, meaning that 0.13.0 was released via Cargo only.

Assets 2

11 Nov 20:23

bede

0.13.0

04c2774

0.13.0

Added

Command deacon index fetch for downloading prebuilt indexes by name. If no index name is specified, the panhuman-1 index is downloaded by default.
Parallel gzip compression of output files with automatic ~1:1 thread allocation between filtering and compression tasks if .gz extensions for --output (-o) or --output2 (-O) arguments are detected.
- Automatic thread allocation can be overriden using new --compression-threads argument.
- ~3x faster filtering when reading and writing gzip-compressed Illumina FASTQs.

Assets 2

16 Oct 14:40

bede

0.12.0

30a3175

0.12.0

Added

Command deacon index intersect for finding the intersection of two or more minimizer indexes.
Command deacon index dump for extracting minimizers from an index as plain text FASTA.
Command deacon cite showing citation info.

Changed

Graceful handling of empty compressed input files (@pmenzel).
Fixes bug where --debug mode incorrectly showed the complement of the hitting minimizer.
Uses paraseq 0.4.3, addressing a bug identified in paraseq 0.4.2 causing FASTQ records without a trailing newline byte to be ignored.
2x increase in filtering throughput on arm64 / MacOS systems enabled by a series of optimisations in latest versions of libraries packed-seq, seq-hash and simd-minimizers (@RagnarGrootKoerkamp).

Contributors

pmenzel and RagnarGrootKoerkamp

Assets 2

07 Oct 16:02

bede

0.11.0

ab884ee

0.11.0

Incorporates new features, fixes and peformance optimisations. Includes many PRs from @RagnarGrootKoerkamp, taking advantage of new features in simd-minimizers, packed-seq and paraseq (@imartayan, @noamteyssier) for improved correctness when handling seqs containing non-ACGT nucleotides. Filtering results are otherwise identical to 0.10.0. This release introduces a new index format that is smaller on disk, faster to load, and stores k-mers directly rather than k-mer hashes, providing flexibility for future performance optimisation without breaking index compatibility.

Added

Local (socket) server mode, enabling successive filter commands to be handled by a persistent server process for low latency filtering (@RagnarGrootKoerkamp).
Support for longer k-mers of up to length 61, where k+w ≤ 96 (packed-seq 4.1.1).

Changed

Faster filtering (x86_64)
Faster paired read filtering, particularly from separate input files which are now decompressed in parallel.
Faster indexing and index loading
While minimizers containing non-ACGT nucleotides were already discarded, minimizer selection could still be influenced by non-ACGT nucleotides present in the window, occasionally impacting results. Enabled by changes in simd-minimizers ≥ 2.0, entire windows containing non-ACGT nucleotides are discarded. Records containing non-ACGT nucleotides may therefore be classified differently to prior Deacon versions.
Redesigned index format (v3).
- Index now stores 'concrete' k-mers using using 2*k bits rounded up to the nearest byte boundary, rather than always using 64bit xxh3 k-mer hashes.
- Eliminates [tiny] risk of false positive matches caused by xxh3 collisions.
- Serialised k-mers are byte-aligned, balancing efficient storage and deserialisation speed.
- Index disk footprint reduced by 10%.
- Paves way for painless future adoption of faster HashSet implementations.
- Paves way for future index introspection functionality.
RapidHashSet (rapidhash::fast) replaces combined use of xxHash (xxh3) and FxHashSet
Fails gracefully given empty input files (#46) @pmenzel.
Bugfix for paired read I/O (#60) @funnell.
Feature gating for reduced compile times @RagnarGrootKoerkamp .

Removed

Removed --capacity argument, which was easily misused for little performance benefit.

Contributors

funnell, pmenzel, and 3 other contributors

Assets 2

01 Sep 15:25

bede

0.10.0

6467204

0.10.0

Support for k-mer length up to 57 (previously 32) (@RagnarGrootKoerkamp)

Contributors

RagnarGrootKoerkamp

Assets 2

15 Aug 18:19

bede

0.9.0

64386eb

0.9.0

Performance optimisations (#27) deliver up to 80% faster filtering with unchanged accuracy (@RagnarGrootKoerkamp).

>2Gbp/s with uncompressed long read input.
>500Mbp/s with gzip-compressed long read input.

Contributors

RagnarGrootKoerkamp

Assets 2

14 Aug 15:22

bede

0.8.1

25a822b

0.8.1

Fixes bug handling paired reads introduced in 0.8.0 which could lead to mispaired read output (@KaibondChau)
Fixes bug handling multiline FASTA input introduced in 0.8.0 (@RagnarGrootKoerkamp)

Contributors

RagnarGrootKoerkamp and KaibondChau

Assets 2

11 Aug 15:11

bede

0.8.0

d854f02

0.8.0

Faster filtering on multicore systems through improved work allocation using the Paraseq library (@noamteyssier). Filtering at >1Gbp/s is possible with uncompressed long sequences, and >500Mbp/s is achievable on many systems with Gzip-compressed long reads. Filtering Illumina reads is roughly twice as fast as before in my testing at ~200Mbp/s.
Added independent absolute (-a) and relative (-r) match thresholds with respective default values of 2 and 0.01 (1%). The new default relative threshold improves search specificity for long sequences over the previous absolute-only default threshold, without affecting short read accuracy. These replace the previous dual purpose -m parameter which could accept either an absolute (integer) threshold or a relative (float) threshold.
Minimizers containing ambiguous nucleotides are now ignored.
deacon index now offers the ability to discard minimizers with information content below a specified scaled Shannon --entropy (-e) threshold. This is disabled by default.
deacon filter now has a --debug mode which prints all records with minimizer matches to stderr including the matched minimizer sequence(s).
The default worst-case hash table capacity preallocation used in deacon index union operations can now be overriden with the new --capacity (-c) argument, in similar fashion to deacon index build.

Contributors

noamteyssier

Assets 2

08 Jul 13:32

bede

0.7.0

2a973f0

0.7.0

Deacon now uses the recently added simd-minimizers::iter_canonical_minimizer_values(), increasing filtering speed by up to 50% on Linux/x86_64 systems. Speeds of 1Gbp/s have been observed with uncompressed FASTA input. Thanks @RagnarGrootKoerkamp for a PR and improvements to simd-minimizers.
- Index format is now version 2. Existing indexes must be rebuilt for use with this version. A new version of the panhuman-1 index is available from Zenodo and object storage. Attempting to load an incompatible index throws an error.
deacon index diff can now accept a fastx file or stream in place of a second index. This enables index masking using massive sequence collections without the need to first index them.
Position-dependent IUPAC ambiguous base canonicalisation was replaced with a simpler and faster fixed mapping, meaning that records containing ambiguous IUPAC bases may be classified differently to before.
deacon index union now automatically preallocates the required hash table capacity, eliminating slowdowns when combining indexes.
Compatible minimizer k and w is now validated (k+w-1 must be odd) prior to indexing.
Default index capacity is now 400M (Was 500M).

Contributors

RagnarGrootKoerkamp

Assets 2

Releases: bede/deacon

0.13.2

Added

Changed

Contributors

Uh oh!

0.13.1

Uh oh!

0.13.0

Added

Uh oh!

0.12.0

Added

Changed

Contributors

Uh oh!

0.11.0

Added

Changed

Removed

Contributors

Uh oh!

0.10.0

Contributors

Uh oh!

0.9.0

Contributors

Uh oh!

0.8.1

Contributors

Uh oh!

0.8.0

Contributors

Uh oh!

0.7.0

Contributors

Uh oh!