Make bcf_readrec able to update phasing, and faster updatephasing() function#2
Merged
vasudeva8 merged 5 commits intovasudeva8:phase44update1from Sep 4, 2025
Merged
Conversation
The motivation for this is to enable passing of a pointer to a bcf_hdr_t structure to bcf_readrec(), which currently does not get one. It does always get a pointer for the BGZF handle, so a header struct could be passed in via that if it can be stored somewhere. To enable this while not changing the bgzf API or ABI, extra fields are added to the opaque bgzf_cache_t field. The BGZF_CACHE macro that could be use to disable addition of the cache feature removed as it was always turned on anyway. The cache struct now has to be created for files open for write, although the cache part is not used. The hash type used by the cache is renamed from "cache" to "bgzf_cache" to improve its name-spacing. The interfaces to add, get, and remove private data are put in a new bgzf_internal.h header. The bgzf_cache_t struct definition is also moved there so that the get function can be inlined for faster access to the private data field. The bgzf_cache_t definition is rewritten slightly so that it's not necessary to invoke KHASH_MAP_INIT_INT64() before it in the header file, as doing that would require struct cache_t to be moved from bgzf.c to the new header as well. Instead, typedef kh_bgzf_cache_t is used in place of khash(bgzf_cache), and unsigned int instead of khint_t.
For bcf files, the header pointer hasn't always been passed into bcf_read(), especially when using iterators. As having it available would be useful for VCF 4.4+ support, this works around its absence by attaching a pointer to the header in BGZF private data, which was previously unused for vcf/bcf. It also adds reference counting to the header so that it can be cleaned up safely irrespective of whether hts_close() or bcf_hdr_destroy() was called first. To avoid ABI breakage, the reference count is stored in the bcf_hdr_aux_t struct.
BCF saved by versions of HTSlib before 1.22 will always store the first phasing bit as 0. For consistency with the VCF reader, update this bit when reading BCF so that is is set if all other phasing bits are also set.
Phasing should now be fixed up in bcf_read()/vcf_read(), so there's no need to try again in bcf_get_format_values().
By noting that we're only interested in the least-significant bit of each GT value, it's possible to reduce the number of branches in this function by doing bit manipulations on the first byte of each stored value. The common haploid and diploid cases are also specialised so the inner loop on ploidy can be avoided for those cases.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
BGZFstruct without changing the API or ABIbcf_hdr_tstruct and add it as BGZF private data