generalized match algorithm for k loci #8

Ofekirsh · 2025-12-18T09:42:48Z

Summary

This PR introduces several substantial improvements to the matching pipeline, focusing on correctness, robustness, and extensibility, while keeping performance and memory usage under control.

contributions:

Correct computation of GVH / HVG mismatches
Fix for a missing-donor enumeration bug
Generalization of matching logic to support arbitrary k loci
Support for additional allele formats with compact storage

Detailed Changes

1. Correct GVH/HVG and Mismatch Computation

Overview:
Alleles at each locus are now treated as sets (duplicates removed) for both patient and donor. This ensures accurate mismatch calculations.

Per-Locus Metrics:
For each locus ℓ, the following metrics are computed:

GVH_ℓ = |Patient \ Donor| (alleles present in patient but not in donor)
HVG_ℓ = |Donor \ Patient| (alleles present in donor but not in patient)
Mismatches_ℓ = max(GVH_ℓ, HVG_ℓ)

Patient–Donor Total Scores:
Aggregate scores are computed by summing across all loci:

Total GVH = ∑_ℓ GVH_ℓ
Total HVG = ∑_ℓ HVG_ℓ
Total Mismatches = ∑_ℓ max(GVH_ℓ, HVG_ℓ)

Modified File:
grma/match/donors_matching.py

2. Fix: missing donor in candidate list

In cpdef tuple neighbors_2nd(self, UINT node), a duplicated -1 placeholder caused one valid donor to be silently dropped from the candidate list.

Removing the extra placeholder restores complete and correct candidate enumeration.

File:
grma/match/lol_graph.pyx

3. Generalization to k loci

All hard-coded assumptions about the number of loci (e.g. the magic constant 10) were removed and replaced with configurable, data-driven logic.

Verified for 4, 5, 6, and 9 locus matching

4. Support additional allele formats with compact storage

Allele parsing and handling were extended beyond the strict xy:wz format to allow:

Varying digit widths
Additional fields and representations

To prevent donor-tree explosion, alleles are stored using compact integer UIDs via a bidirectional mapping (bidict):

Core classes operate on hashed indices for speed and memory efficiency
Hash collisions are harmless. Exact mismatch checks are performed on alleles, and each allele is assigned a unique ID (stored in a bidict).

This preserves correctness while significantly reducing memory pressure.

Affected components:

Tree construction and maintenance
Matching logic
LOL graph building
Utility functions

New dependency: bidict

mmaiers-nmdp · 2025-12-19T20:20:12Z

@Ofekirsh can you fix the errors in the test and formatting?
It doesn't look like the pre-commit was run.
If you do a make install then this will get checked at commit time.

Also, I don't see how it refers to the MLO branch of py-graph-imputation which I thought was required.
To merge the PR the repo needs to be in a state where a clean build (still) works.

generalized match algorithm for k loci

42a2a1b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

generalized match algorithm for k loci #8

generalized match algorithm for k loci #8

Uh oh!

Ofekirsh commented Dec 18, 2025

Uh oh!

mmaiers-nmdp commented Dec 19, 2025 •

edited by pbashyal-nmdp

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

generalized match algorithm for k loci #8

Are you sure you want to change the base?

generalized match algorithm for k loci #8

Uh oh!

Conversation

Ofekirsh commented Dec 18, 2025

Summary

Detailed Changes

1. Correct GVH/HVG and Mismatch Computation

2. Fix: missing donor in candidate list

3. Generalization to k loci

4. Support additional allele formats with compact storage

Uh oh!

mmaiers-nmdp commented Dec 19, 2025 • edited by pbashyal-nmdp Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mmaiers-nmdp commented Dec 19, 2025 •

edited by pbashyal-nmdp

Loading