Releases: bede/deacon
Releases · bede/deacon
0.13.2
Added
deacon filteraccepts--rename-randomfor anonymising read names using both incrementing and random (64 bit) integers to practically assure uniqueness. This mitigates a reported issue relating to identical read names across separate files causing problems during upload to the European Nucleotide Archive (#67, @nick297).
Changed
- Groundwork for ensuring x86 Bioconda binaries are always built for the most portable
x86-64-v3(AVX2) target supporting all AMD and Intel CPUs released in the last decade (#71). - Use
ensure_simdfor compilation and runtime checks for SIMD (@RagnarGrootKoerkamp). - The crate
minreqnow uses therustlsbackend enabling native compilation on a wider range of Linux systems. Implements a simple workaround for the previously describedrustls&ringissue on MacOS ARM runners.
0.13.1
0.13.0
Added
- Command
deacon index fetchfor downloading prebuilt indexes by name. If no index name is specified, thepanhuman-1index is downloaded by default. - Parallel gzip compression of output files with automatic ~1:1 thread allocation between filtering and compression tasks if
.gzextensions for--output(-o) or--output2(-O) arguments are detected.- Automatic thread allocation can be overriden using new
--compression-threadsargument. - ~3x faster filtering when reading and writing gzip-compressed Illumina FASTQs.
- Automatic thread allocation can be overriden using new
0.12.0
Added
- Command
deacon index intersectfor finding the intersection of two or more minimizer indexes. - Command
deacon index dumpfor extracting minimizers from an index as plain text FASTA. - Command
deacon citeshowing citation info.
Changed
- Graceful handling of empty compressed input files (@pmenzel).
- Fixes bug where
--debugmode incorrectly showed the complement of the hitting minimizer. - Uses paraseq 0.4.3, addressing a bug identified in paraseq 0.4.2 causing FASTQ records without a trailing newline byte to be ignored.
- 2x increase in filtering throughput on arm64 / MacOS systems enabled by a series of optimisations in latest versions of libraries packed-seq, seq-hash and simd-minimizers (@RagnarGrootKoerkamp).
0.11.0
Incorporates new features, fixes and peformance optimisations. Includes many PRs from @RagnarGrootKoerkamp, taking advantage of new features in simd-minimizers, packed-seq and paraseq (@imartayan, @noamteyssier) for improved correctness when handling seqs containing non-ACGT nucleotides. Filtering results are otherwise identical to 0.10.0. This release introduces a new index format that is smaller on disk, faster to load, and stores k-mers directly rather than k-mer hashes, providing flexibility for future performance optimisation without breaking index compatibility.
Added
- Local (socket) server mode, enabling successive filter commands to be handled by a persistent server process for low latency filtering (@RagnarGrootKoerkamp).
- Support for longer k-mers of up to length 61, where k+w ≤ 96 (packed-seq 4.1.1).
Changed
- Faster filtering (x86_64)
- Faster paired read filtering, particularly from separate input files which are now decompressed in parallel.
- Faster indexing and index loading
- While minimizers containing non-ACGT nucleotides were already discarded, minimizer selection could still be influenced by non-ACGT nucleotides present in the window, occasionally impacting results. Enabled by changes in simd-minimizers ≥ 2.0, entire windows containing non-ACGT nucleotides are discarded. Records containing non-ACGT nucleotides may therefore be classified differently to prior Deacon versions.
- Redesigned index format (v3).
- Index now stores 'concrete' k-mers using using 2*k bits rounded up to the nearest byte boundary, rather than always using 64bit
xxh3k-mer hashes. - Eliminates [tiny] risk of false positive matches caused by xxh3 collisions.
- Serialised k-mers are byte-aligned, balancing efficient storage and deserialisation speed.
- Index disk footprint reduced by 10%.
- Paves way for painless future adoption of faster HashSet implementations.
- Paves way for future index introspection functionality.
- Index now stores 'concrete' k-mers using using 2*k bits rounded up to the nearest byte boundary, rather than always using 64bit
RapidHashSet(rapidhash::fast) replaces combined use ofxxHash(xxh3) andFxHashSet- Fails gracefully given empty input files (#46) @pmenzel.
- Bugfix for paired read I/O (#60) @funnell.
- Feature gating for reduced compile times @RagnarGrootKoerkamp .
Removed
- Removed
--capacityargument, which was easily misused for little performance benefit.
0.10.0
- Support for k-mer length up to 57 (previously 32) (@RagnarGrootKoerkamp)
0.9.0
0.8.1
- Fixes bug handling paired reads introduced in 0.8.0 which could lead to mispaired read output (@KaibondChau)
- Fixes bug handling multiline FASTA input introduced in 0.8.0 (@RagnarGrootKoerkamp)
0.8.0
- Faster filtering on multicore systems through improved work allocation using the Paraseq library (@noamteyssier). Filtering at >1Gbp/s is possible with uncompressed long sequences, and >500Mbp/s is achievable on many systems with Gzip-compressed long reads. Filtering Illumina reads is roughly twice as fast as before in my testing at ~200Mbp/s.
- Added independent absolute (
-a) and relative (-r) match thresholds with respective default values of 2 and 0.01 (1%). The new default relative threshold improves search specificity for long sequences over the previous absolute-only default threshold, without affecting short read accuracy. These replace the previous dual purpose-mparameter which could accept either an absolute (integer) threshold or a relative (float) threshold. - Minimizers containing ambiguous nucleotides are now ignored.
deacon indexnow offers the ability to discard minimizers with information content below a specified scaled Shannon--entropy(-e) threshold. This is disabled by default.deacon filternow has a--debugmode which prints all records with minimizer matches to stderr including the matched minimizer sequence(s).- The default worst-case hash table capacity preallocation used in
deacon index unionoperations can now be overriden with the new--capacity(-c) argument, in similar fashion todeacon index build.
0.7.0
- Deacon now uses the recently added
simd-minimizers::iter_canonical_minimizer_values(), increasing filtering speed by up to 50% on Linux/x86_64 systems. Speeds of 1Gbp/s have been observed with uncompressed FASTA input. Thanks @RagnarGrootKoerkamp for a PR and improvements to simd-minimizers.- Index format is now version 2. Existing indexes must be rebuilt for use with this version. A new version of the panhuman-1 index is available from Zenodo and object storage. Attempting to load an incompatible index throws an error.
deacon index diffcan now accept a fastx file or stream in place of a second index. This enables index masking using massive sequence collections without the need to first index them.- Position-dependent IUPAC ambiguous base canonicalisation was replaced with a simpler and faster fixed mapping, meaning that records containing ambiguous IUPAC bases may be classified differently to before.
deacon index unionnow automatically preallocates the required hash table capacity, eliminating slowdowns when combining indexes.- Compatible minimizer k and w is now validated (k+w-1 must be odd) prior to indexing.
- Default index capacity is now 400M (Was 500M).