Create pattern to support multiple output formats by diehlbw · Pull Request #10 · epic-open-source/evaluation-instruments

diehlbw · 2025-07-15T20:19:24Z

Overview

For integrtation and reuse by external tools, such as medHELM, it is useful to have an accessible resolve_prompt that has instruction sets for both "score-only" (like PDSQI-9) and "score + explanation" (like Summary of Care).
Further the expectation of these being nested dictionaries {"grade": int, "explanation": str} is more aligned with their existing expectations.

Description of changes

Create a method in PDSQI-9 to resolve the instructions where 2 are replaced with instructions to return {explanation: str, grade: int}. Default behavior remains the published / vetted score-only instruction.
Update the two Epic prompts to both use this dictionary response instead of a list, and to have a toggle to go between. But these will default to showing explanation.
Update the eval scoring method to handle either dictionary response (instead of list) or direct value. This removes the argument to specify names for the subcolumns.

Author Checklist

Linting passes; run early with pre-commit hook.
Tests added for new code and issue being fixed.
Added type annotations and full numpy-style docstrings for new methods.
Draft your news fragment in new changelog/ISSUE.TYPE.rst files; see changelog/README.md.

MiguelAFH · 2025-07-16T18:58:52Z

instruments/pdsqi_9/pdsqi_prompt.py

+DETAIL_INSTRUCTIONS = {
+    1: "- Your output must be JSON-formatted, where each key is one of your RUBRIC_SET items (e.g., \"Citation\") and each corresponding value is another dictionary of two key-value pairs: \"explanation\" is a free text explanation of why your chosen GRADE is the correct grade, and \"grade\" is a single integer representing your respective GRADE that best matches the CLINICAL_SUMMARY for the key's metric.",
+    3: "",
+    6: '- Your output must ba VALID JSON-formatted string as follows:\n\"{"citation": {"explanation": "Your explanation here", "grade": 1}, "accurate": {"explanation": "Your explanation here", "grade": 1}, ...}\"'


Can we use score instead of grade?

MiguelAFH · 2025-07-16T21:45:17Z

Also, just a general question. Is the python 3.10 requirement something that could be discussed to lower to 3.9? HELM uses python 3.9 so it conflicts with evaluation-instruments. This is also in my meeting agenda with the HELM team for tomorrow.

diehlbw · 2025-07-17T11:06:22Z

Also, just a general question. Is the python 3.10 requirement something that could be discussed to lower to 3.9? HELM uses python 3.9 so it conflicts with evaluation-instruments. This is also in my meeting agenda with the HELM team for tomorrow.

Can you write this up as an issue, it would not be done in this PR. And I'd like to get reasoning discoverable there.
My immediate thought are an unhelpful "maybe?" I can't think of anything offhand that uses 3.10 features, but two things for us:

seismometer consistency: while the integration is currently lacking, the vision is to have them working together; seismometer is also 3.10 and much tougher to argue changing.
python EOL: this will be most of the discussion, balancing the benefits of bumping this down to quickly come back and move it forward.

…tions

MiguelAFH · 2025-07-21T20:13:54Z

Also, just a general question. Is the python 3.10 requirement something that could be discussed to lower to 3.9? HELM uses python 3.9 so it conflicts with evaluation-instruments. This is also in my meeting agenda with the HELM team for tomorrow.

Can you write this up as an issue, it would not be done in this PR. And I'd like to get reasoning discoverable there. My immediate thought are an unhelpful "maybe?" I can't think of anything offhand that uses 3.10 features, but two things for us:

seismometer consistency: while the integration is currently lacking, the vision is to have them working together; seismometer is also 3.10 and much tougher to argue changing.

python EOL: this will be most of the discussion, balancing the benefits of bumping this down to quickly come back and move it forward.

After talking to the HELM team, this won't be necessary. HELM supports python 3.10 and 3.11.

MiguelAFH · 2025-07-22T17:32:31Z

src/evaluation_instruments/instruments/pdsqi_9/pdsqi_prompt.py

+from typing import Any
+import evaluation_instruments.prep as prep

+OUTPUT_MODE = "score_only"  # Default output mode


I think it makes sense to use be consistent with how the enum is used and explained to the user. Here we set it to the string value but then the resolve_prompt function asks for one of the enum names, not the string values.

Changed everything to Enum

src/evaluation_instruments/instruments/epic_draft_appeal/Draft_Appeal.ipynb

src/evaluation_instruments/prep/data_handler.py

src/evaluation_instruments/post/__init__.py

src/evaluation_instruments/instruments/pdsqi_9/pdsqi_prompt.py

MiguelAFH reviewed Jul 16, 2025

View reviewed changes

diehlbw added 8 commits July 21, 2025 11:07

pqsqi: Allow overridable instructions + builtin for returning explana…

6159f92

…tions

🧪update tests

bbfa156

newsfragment + logging

a79ce97

avoid nullable-bool => enum

3310693

return key of score not grade

a60c3fb

Generalize to OutputMode

fd57854

sync epic instruments

5662adc

align notebooks

1d19e6d

diehlbw force-pushed the bdiehl/output_options branch from be80924 to 1d19e6d Compare July 21, 2025 11:11

diehlbw marked this pull request as ready for review July 21, 2025 11:14

diehlbw requested a review from MahmoodEtedadi July 21, 2025 11:14

test instruction lines can be pruned with ""

2a3b585

MiguelAFH reviewed Jul 22, 2025

View reviewed changes

MahmoodEtedadi reviewed Jul 23, 2025

View reviewed changes

diehlbw added 3 commits July 23, 2025 20:13

🪳Req default_mode; clean notebooks

c88c59b

document str|mode

c43ab2c

require Enum

1388477

MahmoodEtedadi approved these changes Jul 24, 2025

View reviewed changes

MiguelAFH approved these changes Jul 24, 2025

View reviewed changes

diehlbw merged commit 1c4637e into epic-open-source:main Jul 25, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create pattern to support multiple output formats#10

Create pattern to support multiple output formats#10
diehlbw merged 12 commits intoepic-open-source:mainfrom
diehlbw:bdiehl/output_options

diehlbw commented Jul 15, 2025

Uh oh!

MiguelAFH Jul 16, 2025

Uh oh!

diehlbw Jul 21, 2025

Uh oh!

MiguelAFH commented Jul 16, 2025

Uh oh!

diehlbw commented Jul 17, 2025

Uh oh!

MiguelAFH commented Jul 21, 2025

Uh oh!

MiguelAFH Jul 22, 2025

Uh oh!

diehlbw Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

diehlbw commented Jul 15, 2025

Overview

Description of changes

Author Checklist

Uh oh!

MiguelAFH Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

diehlbw Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

MiguelAFH commented Jul 16, 2025

Uh oh!

diehlbw commented Jul 17, 2025

Uh oh!

MiguelAFH commented Jul 21, 2025

Uh oh!

MiguelAFH Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

diehlbw Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants