LLM Evals — Metrics-Only Starter

A minimal, re-eval only toolkit to teach and demo differences between:

ROUGE-1/2/L (n-gram overlap)
BERTScore (semantic similarity)
F1 for structured extraction (precision/recall/F1 on sets)

Quickstart

Requirements

make setup          # 3.11 venv + rouge only
source .venv/bin/activate
make eval-text      # ROUGE + SynonymSim (no Torch)

Run text metrics (ROUGE & BERTScore)

make eval-text
# See outputs/metric_playground_report.json

Run F1 on structured extraction

make eval-fields
# See outputs/field_f1_report.json

How to demo metric trade-offs

Open data/eval/metric_playground.jsonl (each line has a reference and candidate).

ROUGE ↑, BERTScore ↓: add unrelated fluff to the candidate
e.g., append: "Also start ibuprofen 600 mg every 8 hours."
Re-run make eval-text → ROUGE often stays high; BERTScore drops.
ROUGE ↓, BERTScore ↑: paraphrase with synonyms
e.g., "shortness of breath" → "dyspnea", "chest pain" → "angina".
Re-run → ROUGE falls (fewer exact n-grams); BERTScore rises (semantic match).
Negation flip (semantic mismatch): "Denies fever" → "Has fever".
Re-run → ROUGE may stay moderate; BERTScore falls.

For F1, edit the predicted sets in data/eval/field_extraction_pred.jsonl (add/remove a med or change a dose), then run make eval-fields. You’ll see precision/recall/F1 respond directly to set-level correctness.

Files

data/eval/metric_playground.jsonl — text example pairs for ROUGE/BERTScore
data/eval/field_extraction_gold.jsonl — gold labels for structured fields
data/eval/field_extraction_pred.jsonl — edit me predictions file
src/evals/text_metrics.py — computes ROUGE-1/2/L + BERTScore
src/evals/f1_eval.py — computes per-field and macro F1
outputs/*.json — metric reports

Optional: Enable true BERTScore (requires PyTorch)

By default, the playground uses a lightweight semantic fallback (SynonymSim) so you don’t need PyTorch.

If you want true BERTScore:

pip install -r requirements-bert.txt
# then:
make eval-text SEMANTIC=bert

If you want to stick to the no-download path, run:

make eval-text           # uses SynonymSim (no torch)

Auto-use Python 3.11 (avoids NumPy/Torch version issues)

macOS/Linux

# creates a venv with python3.11 and installs deps
make setup          # ROUGE + SynonymSim (no torch)
make setup-bert     # also installs torch + transformers + bert-score
source .venv/bin/activate
make eval-text              # default: SEMANTIC=synonym
make eval-text SEMANTIC=bert  # requires setup-bert

If you don't have Python 3.11:

macOS: brew install python@3.11

Ubuntu: sudo apt-get install -y python3.11 python3.11-venv

Windows (PowerShell)

# uses 'py -3.11' to create the venv and install deps
powershell -ExecutionPolicy Bypass -File scripts\setup.ps1          # lite (no torch)
powershell -ExecutionPolicy Bypass -File scripts\setup.ps1 -Mode bert  # with torch
.\.venv\Scripts\Activate.ps1
make eval-text
# or: make eval-text SEMANTIC=bert

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data/eval		data/eval
scripts		scripts
src/evals		src/evals
.DS_Store		.DS_Store
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements-bert.txt		requirements-bert.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Evals — Metrics-Only Starter

Quickstart

How to demo metric trade-offs

Files

Optional: Enable true BERTScore (requires PyTorch)

Auto-use Python 3.11 (avoids NumPy/Torch version issues)

About

Uh oh!

Releases

Packages

Languages

License

toruncem/llm-eval-healthcare

Folders and files

Latest commit

History

Repository files navigation

LLM Evals — Metrics-Only Starter

Quickstart

How to demo metric trade-offs

Files

Optional: Enable true BERTScore (requires PyTorch)

Auto-use Python 3.11 (avoids NumPy/Torch version issues)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages