Skip to content

Create pattern to support multiple output formats#10

Merged
diehlbw merged 12 commits intoepic-open-source:mainfrom
diehlbw:bdiehl/output_options
Jul 25, 2025
Merged

Create pattern to support multiple output formats#10
diehlbw merged 12 commits intoepic-open-source:mainfrom
diehlbw:bdiehl/output_options

Conversation

@diehlbw
Copy link
Contributor

@diehlbw diehlbw commented Jul 15, 2025

Overview

For integrtation and reuse by external tools, such as medHELM, it is useful to have an accessible resolve_prompt that has instruction sets for both "score-only" (like PDSQI-9) and "score + explanation" (like Summary of Care).
Further the expectation of these being nested dictionaries {"grade": int, "explanation": str} is more aligned with their existing expectations.

Description of changes

  • Create a method in PDSQI-9 to resolve the instructions where 2 are replaced with instructions to return {explanation: str, grade: int}. Default behavior remains the published / vetted score-only instruction.
  • Update the two Epic prompts to both use this dictionary response instead of a list, and to have a toggle to go between. But these will default to showing explanation.
  • Update the eval scoring method to handle either dictionary response (instead of list) or direct value. This removes the argument to specify names for the subcolumns.

Author Checklist

  • Linting passes; run early with pre-commit hook.
  • Tests added for new code and issue being fixed.
  • Added type annotations and full numpy-style docstrings for new methods.
  • Draft your news fragment in new changelog/ISSUE.TYPE.rst files; see changelog/README.md.

DETAIL_INSTRUCTIONS = {
1: "- Your output must be JSON-formatted, where each key is one of your RUBRIC_SET items (e.g., \"Citation\") and each corresponding value is another dictionary of two key-value pairs: \"explanation\" is a free text explanation of why your chosen GRADE is the correct grade, and \"grade\" is a single integer representing your respective GRADE that best matches the CLINICAL_SUMMARY for the key's metric.",
3: "",
6: '- Your output must ba VALID JSON-formatted string as follows:\n\"{"citation": {"explanation": "Your explanation here", "grade": 1}, "accurate": {"explanation": "Your explanation here", "grade": 1}, ...}\"'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use score instead of grade?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@MiguelAFH
Copy link

Also, just a general question. Is the python 3.10 requirement something that could be discussed to lower to 3.9? HELM uses python 3.9 so it conflicts with evaluation-instruments. This is also in my meeting agenda with the HELM team for tomorrow.

@diehlbw
Copy link
Contributor Author

diehlbw commented Jul 17, 2025

Also, just a general question. Is the python 3.10 requirement something that could be discussed to lower to 3.9? HELM uses python 3.9 so it conflicts with evaluation-instruments. This is also in my meeting agenda with the HELM team for tomorrow.

Can you write this up as an issue, it would not be done in this PR. And I'd like to get reasoning discoverable there.
My immediate thought are an unhelpful "maybe?" I can't think of anything offhand that uses 3.10 features, but two things for us:

  • seismometer consistency: while the integration is currently lacking, the vision is to have them working together; seismometer is also 3.10 and much tougher to argue changing.
  • python EOL: this will be most of the discussion, balancing the benefits of bumping this down to quickly come back and move it forward.

@diehlbw diehlbw force-pushed the bdiehl/output_options branch from be80924 to 1d19e6d Compare July 21, 2025 11:11
@diehlbw diehlbw marked this pull request as ready for review July 21, 2025 11:14
@diehlbw diehlbw requested a review from MahmoodEtedadi July 21, 2025 11:14
@MiguelAFH
Copy link

Also, just a general question. Is the python 3.10 requirement something that could be discussed to lower to 3.9? HELM uses python 3.9 so it conflicts with evaluation-instruments. This is also in my meeting agenda with the HELM team for tomorrow.

Can you write this up as an issue, it would not be done in this PR. And I'd like to get reasoning discoverable there. My immediate thought are an unhelpful "maybe?" I can't think of anything offhand that uses 3.10 features, but two things for us:

  • seismometer consistency: while the integration is currently lacking, the vision is to have them working together; seismometer is also 3.10 and much tougher to argue changing.
  • python EOL: this will be most of the discussion, balancing the benefits of bumping this down to quickly come back and move it forward.

After talking to the HELM team, this won't be necessary. HELM supports python 3.10 and 3.11.

from typing import Any
import evaluation_instruments.prep as prep

OUTPUT_MODE = "score_only" # Default output mode

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to use be consistent with how the enum is used and explained to the user. Here we set it to the string value but then the resolve_prompt function asks for one of the enum names, not the string values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed everything to Enum

@diehlbw diehlbw merged commit 1c4637e into epic-open-source:main Jul 25, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants