Skip to content

toruncem/llm-eval-healthcare

Repository files navigation

LLM Evals — Metrics-Only Starter

A minimal, re-eval only toolkit to teach and demo differences between:

  • ROUGE-1/2/L (n-gram overlap)
  • BERTScore (semantic similarity)
  • F1 for structured extraction (precision/recall/F1 on sets)

Quickstart

Requirements

make setup          # 3.11 venv + rouge only
source .venv/bin/activate
make eval-text      # ROUGE + SynonymSim (no Torch)

Run text metrics (ROUGE & BERTScore)

make eval-text
# See outputs/metric_playground_report.json

Run F1 on structured extraction

make eval-fields
# See outputs/field_f1_report.json

How to demo metric trade-offs

Open data/eval/metric_playground.jsonl (each line has a reference and candidate).

  • ROUGE ↑, BERTScore ↓: add unrelated fluff to the candidate
    e.g., append: "Also start ibuprofen 600 mg every 8 hours."
    Re-run make eval-text → ROUGE often stays high; BERTScore drops.

  • ROUGE ↓, BERTScore ↑: paraphrase with synonyms
    e.g., "shortness of breath" → "dyspnea", "chest pain" → "angina".
    Re-run → ROUGE falls (fewer exact n-grams); BERTScore rises (semantic match).

  • Negation flip (semantic mismatch): "Denies fever" → "Has fever".
    Re-run → ROUGE may stay moderate; BERTScore falls.

For F1, edit the predicted sets in data/eval/field_extraction_pred.jsonl (add/remove a med or change a dose), then run make eval-fields. You’ll see precision/recall/F1 respond directly to set-level correctness.


Files

  • data/eval/metric_playground.jsonl — text example pairs for ROUGE/BERTScore
  • data/eval/field_extraction_gold.jsonl — gold labels for structured fields
  • data/eval/field_extraction_pred.jsonledit me predictions file
  • src/evals/text_metrics.py — computes ROUGE-1/2/L + BERTScore
  • src/evals/f1_eval.py — computes per-field and macro F1
  • outputs/*.json — metric reports

Optional: Enable true BERTScore (requires PyTorch)

By default, the playground uses a lightweight semantic fallback (SynonymSim) so you don’t need PyTorch.

If you want true BERTScore:

pip install -r requirements-bert.txt
# then:
make eval-text SEMANTIC=bert

If you want to stick to the no-download path, run:

make eval-text           # uses SynonymSim (no torch)

Auto-use Python 3.11 (avoids NumPy/Torch version issues)

macOS/Linux

# creates a venv with python3.11 and installs deps
make setup          # ROUGE + SynonymSim (no torch)
make setup-bert     # also installs torch + transformers + bert-score
source .venv/bin/activate
make eval-text              # default: SEMANTIC=synonym
make eval-text SEMANTIC=bert  # requires setup-bert

If you don't have Python 3.11:

  • macOS: brew install python@3.11
  • Ubuntu: sudo apt-get install -y python3.11 python3.11-venv

Windows (PowerShell)

# uses 'py -3.11' to create the venv and install deps
powershell -ExecutionPolicy Bypass -File scripts\setup.ps1          # lite (no torch)
powershell -ExecutionPolicy Bypass -File scripts\setup.ps1 -Mode bert  # with torch
.\.venv\Scripts\Activate.ps1
make eval-text
# or: make eval-text SEMANTIC=bert

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published