Skip to content

Conversation

@ppinchuk
Copy link
Collaborator

@ppinchuk ppinchuk commented Dec 9, 2025

Update heuristic to be more robust on larger documents. In particular, phrase ordering now matters, so heuristic can be more reliably applied to an entire document instead of just a chunk.

This involved moving the ngrams module, which now has tests added as well.

@ppinchuk ppinchuk self-assigned this Dec 9, 2025
Copilot AI review requested due to automatic review settings December 9, 2025 01:32
@ppinchuk ppinchuk added the enhancement Update to logic or general code improvements label Dec 9, 2025
@ppinchuk ppinchuk requested a review from castelao as a code owner December 9, 2025 01:32
@ppinchuk ppinchuk added refactor Code improvements that do not change functionality topic-python-general Issues/pull requests related to python p-high Priority: high labels Dec 9, 2025
@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 78.94737% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.26%. Comparing base (42b57b7) to head (83b1c7b).

Files with missing lines Patch % Lines
compass/validation/content.py 76.47% 3 Missing and 1 partial ⚠️

❌ Your patch status has failed because the patch coverage (78.94%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #361      +/-   ##
==========================================
+ Coverage   55.49%   56.26%   +0.76%     
==========================================
  Files          45       45              
  Lines        4303     4319      +16     
  Branches      391      395       +4     
==========================================
+ Hits         2388     2430      +42     
+ Misses       1888     1861      -27     
- Partials       27       28       +1     
Flag Coverage Δ
unittests 56.26% <78.94%> (+0.76%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the heuristic used for detecting technology mentions in ordinance documents by making phrase ordering significant and moving the ngrams module to a more appropriate location with added test coverage.

Key changes:

  • Phrase matching now uses ordered ngrams instead of unordered word presence, making the heuristic more robust on larger documents
  • The ngrams module relocated from compass/extraction/ngrams.py to compass/utilities/ngrams.py with comprehensive unit tests
  • Fixed documentation bug in sentence_ngram_containment to correctly indicate it returns 0.0 (not True) when test text has no ngrams

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/python/unit/utilities/test_utilities_ngrams.py New comprehensive test file for ngrams utilities covering word filtering, ngram conversion, and containment logic
tests/python/unit/extraction/test_extraction_validation.py Added test cases for heuristics with newline characters in phrases to validate robustness
compass/validation/content.py Refactored _count_phrase_matches to use ordered ngrams with caching and pluralization support, improving phrase detection accuracy
compass/utilities/ngrams.py Fixed return value documentation for empty test text case (changed from "Always returns True" to "Returns 0")
compass/extraction/apply.py Updated import statement to reference ngrams from new utilities location

@ppinchuk ppinchuk merged commit 55c68f8 into main Dec 11, 2025
24 checks passed
@ppinchuk ppinchuk deleted the pp/heuristic_update branch December 11, 2025 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Update to logic or general code improvements p-high Priority: high refactor Code improvements that do not change functionality topic-python-general Issues/pull requests related to python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants