Heuristic update #361

ppinchuk · 2025-12-09T01:32:31Z

Update heuristic to be more robust on larger documents. In particular, phrase ordering now matters, so heuristic can be more reliably applied to an entire document instead of just a chunk.

This involved moving the ngrams module, which now has tests added as well.

codecov-commenter · 2025-12-09T01:34:04Z

Codecov Report

❌ Patch coverage is 78.94737% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.26%. Comparing base (42b57b7) to head (83b1c7b).

Files with missing lines	Patch %	Lines
compass/validation/content.py	76.47%	3 Missing and 1 partial ⚠️

❌ Your patch status has failed because the patch coverage (78.94%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #361      +/-   ##
==========================================
+ Coverage   55.49%   56.26%   +0.76%     
==========================================
  Files          45       45              
  Lines        4303     4319      +16     
  Branches      391      395       +4     
==========================================
+ Hits         2388     2430      +42     
+ Misses       1888     1861      -27     
- Partials       27       28       +1

Flag	Coverage Δ
unittests	`56.26% <78.94%> (+0.76%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR enhances the heuristic used for detecting technology mentions in ordinance documents by making phrase ordering significant and moving the ngrams module to a more appropriate location with added test coverage.

Key changes:

Phrase matching now uses ordered ngrams instead of unordered word presence, making the heuristic more robust on larger documents
The ngrams module relocated from compass/extraction/ngrams.py to compass/utilities/ngrams.py with comprehensive unit tests
Fixed documentation bug in sentence_ngram_containment to correctly indicate it returns 0.0 (not True) when test text has no ngrams

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/python/unit/utilities/test_utilities_ngrams.py	New comprehensive test file for ngrams utilities covering word filtering, ngram conversion, and containment logic
tests/python/unit/extraction/test_extraction_validation.py	Added test cases for heuristics with newline characters in phrases to validate robustness
compass/validation/content.py	Refactored `_count_phrase_matches` to use ordered ngrams with caching and pluralization support, improving phrase detection accuracy
compass/utilities/ngrams.py	Fixed return value documentation for empty test text case (changed from "Always returns True" to "Returns 0")
compass/extraction/apply.py	Updated import statement to reference ngrams from new utilities location

tests/python/unit/utilities/test_utilities_ngrams.py

ppinchuk added 5 commits December 8, 2025 16:41

Move file

50ef2fa

Always return float

dc4e6df

Use ngrams to check if phrase is in text

af8852c

Add extra test cases

520e46c

Add tests for ngrams

83b1c7b

ppinchuk added this to the Infrastructure and accuracy improvements milestone Dec 9, 2025

ppinchuk self-assigned this Dec 9, 2025

Copilot AI review requested due to automatic review settings December 9, 2025 01:32

ppinchuk added the enhancement Update to logic or general code improvements label Dec 9, 2025

ppinchuk requested a review from castelao as a code owner December 9, 2025 01:32

ppinchuk added refactor Code improvements that do not change functionality topic-python-general Issues/pull requests related to python p-high Priority: high labels Dec 9, 2025

Copilot started reviewing on behalf of ppinchuk December 9, 2025 01:33 View session

Copilot AI reviewed Dec 9, 2025

View reviewed changes

tests/python/unit/utilities/test_utilities_ngrams.py Show resolved Hide resolved

castelao approved these changes Dec 11, 2025

View reviewed changes

ppinchuk merged commit 55c68f8 into main Dec 11, 2025
24 checks passed

ppinchuk deleted the pp/heuristic_update branch December 11, 2025 18:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Heuristic update #361

Heuristic update #361

Uh oh!

ppinchuk commented Dec 9, 2025

Uh oh!

codecov-commenter commented Dec 9, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Heuristic update #361

Heuristic update #361

Uh oh!

Conversation

ppinchuk commented Dec 9, 2025

Uh oh!

codecov-commenter commented Dec 9, 2025

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants