Skip to content

Conversation

@Ofekirsh
Copy link

Summary

This PR introduces several substantial improvements to the matching pipeline, focusing on correctness, robustness, and extensibility, while keeping performance and memory usage under control.

contributions:

  • Correct computation of GVH / HVG mismatches
  • Fix for a missing-donor enumeration bug
  • Generalization of matching logic to support arbitrary k loci
  • Support for additional allele formats with compact storage

Detailed Changes

1. Correct GVH/HVG and Mismatch Computation

Overview:
Alleles at each locus are now treated as sets (duplicates removed) for both patient and donor. This ensures accurate mismatch calculations.

Per-Locus Metrics:
For each locus ℓ, the following metrics are computed:

  • GVH = |Patient \ Donor| (alleles present in patient but not in donor)
  • HVG = |Donor \ Patient| (alleles present in donor but not in patient)
  • Mismatches = max(GVH, HVG)

Patient–Donor Total Scores:
Aggregate scores are computed by summing across all loci:

  • Total GVH = ∑ GVH
  • Total HVG = ∑ HVG
  • Total Mismatches = ∑ max(GVH, HVG)

Modified File:
grma/match/donors_matching.py


2. Fix: missing donor in candidate list

In cpdef tuple neighbors_2nd(self, UINT node), a duplicated -1 placeholder caused one valid donor to be silently dropped from the candidate list.

Removing the extra placeholder restores complete and correct candidate enumeration.

File:
grma/match/lol_graph.pyx


3. Generalization to k loci

All hard-coded assumptions about the number of loci (e.g. the magic constant 10) were removed and replaced with configurable, data-driven logic.

  • Verified for 4, 5, 6, and 9 locus matching

4. Support additional allele formats with compact storage

Allele parsing and handling were extended beyond the strict xy:wz format to allow:

  • Varying digit widths
  • Additional fields and representations

To prevent donor-tree explosion, alleles are stored using compact integer UIDs via a bidirectional mapping (bidict):

  • Core classes operate on hashed indices for speed and memory efficiency
  • Hash collisions are harmless. Exact mismatch checks are performed on alleles, and each allele is assigned a unique ID (stored in a bidict).

This preserves correctness while significantly reducing memory pressure.

Affected components:

  • Tree construction and maintenance
  • Matching logic
  • LOL graph building
  • Utility functions

New dependency: bidict


@mmaiers-nmdp
Copy link
Contributor

mmaiers-nmdp commented Dec 19, 2025

@Ofekirsh can you fix the errors in the test and formatting?
It doesn't look like the pre-commit was run.
If you do a make install then this will get checked at commit time.

Also, I don't see how it refers to the MLO branch of py-graph-imputation which I thought was required.
To merge the PR the repo needs to be in a state where a clean build (still) works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants