feat: Add evaluation #247

rmitsch · 2025-12-26T11:54:08Z

Description

Adds evaluation capabilities to predictive tasks.

Related Issues

-

Changes Made

Adds evaluation capabilities (precise or with LLM judges) to all predictive tasks.
Standardize score instead of confidence in terminology and prompts.

Checklist

Tests have been extended to cover changes in functionality
Existing and new tests succeed
Documentation updated (if applicable)
Related issues linked

Screenshots/Examples (if applicable)

codecov · 2025-12-26T12:26:06Z

Codecov Report

❌ Patch coverage is 87.79221% with 47 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
sieves/tasks/predictive/classification/core.py	83.95%	13 Missing ⚠️
sieves/tasks/predictive/core.py	84.00%	8 Missing ⚠️
sieves/tasks/predictive/ner/core.py	83.33%	5 Missing ⚠️
sieves/tasks/predictive/evaluation.py	80.00%	4 Missing ⚠️
...es/tasks/predictive/information_extraction/core.py	91.83%	4 Missing ⚠️
sieves/tasks/predictive/schemas/classification.py	70.00%	3 Missing ⚠️
sieves/tasks/predictive/sentiment_analysis/core.py	90.00%	3 Missing ⚠️
sieves/pipeline/core.py	83.33%	2 Missing ⚠️
sieves/tasks/predictive/pii_masking/core.py	92.59%	2 Missing ⚠️
...ieves/tasks/predictive/relation_extraction/core.py	92.59%	2 Missing ⚠️
... and 1 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #247      +/-   ##
==========================================
- Coverage   92.80%   92.70%   -0.11%     
==========================================
  Files          78       79       +1     
  Lines        4087     4413     +326     
==========================================
+ Hits         3793     4091     +298     
- Misses        294      322      +28

Files with missing lines	Coverage Δ
sieves/data/doc.py	`96.22% <100.00%> (+0.07%)`	⬆️
sieves/tasks/predictive/classification/bridges.py	`94.40% <100.00%> (ø)`
sieves/tasks/predictive/question_answering/core.py	`94.36% <100.00%> (+0.92%)`	⬆️
sieves/tasks/predictive/schemas/summarization.py	`100.00% <100.00%> (ø)`
...ves/tasks/predictive/sentiment_analysis/bridges.py	`98.37% <100.00%> (ø)`
sieves/tasks/predictive/summarization/core.py	`96.61% <100.00%> (+0.69%)`	⬆️
sieves/tasks/predictive/translation/core.py	`96.61% <100.00%> (+0.69%)`	⬆️
sieves/tasks/core.py	`96.00% <75.00%> (-1.83%)`	⬇️
sieves/pipeline/core.py	`92.64% <83.33%> (-0.91%)`	⬇️
sieves/tasks/predictive/pii_masking/core.py	`95.69% <92.59%> (-1.36%)`	⬇️
... and 8 more

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ation task.

Raphael Mitsch added 6 commits December 26, 2025 00:19

feat: Add evaluation support for all predictive tasks.

b16fd91

docs: Update docs w.r.t. evaluation feature.

04d661a

fix: Fix evaluation. chore: Update case study to use inbuilt evaluation.

a640137

chore: Update crisi tweets demo.

0f06274

docs: Update docs w.r.t. eval data format.

086dc5a

fix: Fix tests.

cac0f0e

rmitsch self-assigned this Dec 26, 2025

Raphael Mitsch added 2 commits December 26, 2025 13:18

refactor: Assert task results in _compute_metrics().

ae64250

chore: Clean up.

92552a5

Raphael Mitsch added 2 commits December 26, 2025 13:30

chore: Cleanup.

31ae361

chore: Comment cleanup. test: Allow more test flakiness for classific…

836c8c0

…ation task.

rmitsch marked this pull request as ready for review December 26, 2025 12:55

rmitsch merged commit b17c503 into main Dec 26, 2025
3 checks passed

rmitsch deleted the feature/evaluate branch December 26, 2025 12:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: Add evaluation #247

feat: Add evaluation #247

Uh oh!

rmitsch commented Dec 26, 2025 •

edited

Loading

Uh oh!

codecov bot commented Dec 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

feat: Add evaluation #247

feat: Add evaluation #247

Uh oh!

Conversation

rmitsch commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Changes Made

Checklist

Screenshots/Examples (if applicable)

Uh oh!

codecov bot commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rmitsch commented Dec 26, 2025 •

edited

Loading

codecov bot commented Dec 26, 2025 •

edited

Loading