-
Notifications
You must be signed in to change notification settings - Fork 2
Heuristic update #361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heuristic update #361
Conversation
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (78.94%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #361 +/- ##
==========================================
+ Coverage 55.49% 56.26% +0.76%
==========================================
Files 45 45
Lines 4303 4319 +16
Branches 391 395 +4
==========================================
+ Hits 2388 2430 +42
+ Misses 1888 1861 -27
- Partials 27 28 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enhances the heuristic used for detecting technology mentions in ordinance documents by making phrase ordering significant and moving the ngrams module to a more appropriate location with added test coverage.
Key changes:
- Phrase matching now uses ordered ngrams instead of unordered word presence, making the heuristic more robust on larger documents
- The ngrams module relocated from
compass/extraction/ngrams.pytocompass/utilities/ngrams.pywith comprehensive unit tests - Fixed documentation bug in
sentence_ngram_containmentto correctly indicate it returns0.0(notTrue) when test text has no ngrams
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/python/unit/utilities/test_utilities_ngrams.py | New comprehensive test file for ngrams utilities covering word filtering, ngram conversion, and containment logic |
| tests/python/unit/extraction/test_extraction_validation.py | Added test cases for heuristics with newline characters in phrases to validate robustness |
| compass/validation/content.py | Refactored _count_phrase_matches to use ordered ngrams with caching and pluralization support, improving phrase detection accuracy |
| compass/utilities/ngrams.py | Fixed return value documentation for empty test text case (changed from "Always returns True" to "Returns 0") |
| compass/extraction/apply.py | Updated import statement to reference ngrams from new utilities location |
Update heuristic to be more robust on larger documents. In particular, phrase ordering now matters, so heuristic can be more reliably applied to an entire document instead of just a chunk.
This involved moving the ngrams module, which now has tests added as well.