A minimal, re-eval only toolkit to teach and demo differences between:
- ROUGE-1/2/L (n-gram overlap)
- BERTScore (semantic similarity)
- F1 for structured extraction (precision/recall/F1 on sets)
Requirements
make setup # 3.11 venv + rouge only
source .venv/bin/activate
make eval-text # ROUGE + SynonymSim (no Torch)Run text metrics (ROUGE & BERTScore)
make eval-text
# See outputs/metric_playground_report.jsonRun F1 on structured extraction
make eval-fields
# See outputs/field_f1_report.jsonOpen data/eval/metric_playground.jsonl (each line has a reference and candidate).
-
ROUGE ↑, BERTScore ↓: add unrelated fluff to the candidate
e.g., append: "Also start ibuprofen 600 mg every 8 hours."
Re-runmake eval-text→ ROUGE often stays high; BERTScore drops. -
ROUGE ↓, BERTScore ↑: paraphrase with synonyms
e.g., "shortness of breath" → "dyspnea", "chest pain" → "angina".
Re-run → ROUGE falls (fewer exact n-grams); BERTScore rises (semantic match). -
Negation flip (semantic mismatch): "Denies fever" → "Has fever".
Re-run → ROUGE may stay moderate; BERTScore falls.
For F1, edit the predicted sets in data/eval/field_extraction_pred.jsonl (add/remove a med or change a dose), then run make eval-fields. You’ll see precision/recall/F1 respond directly to set-level correctness.
data/eval/metric_playground.jsonl— text example pairs for ROUGE/BERTScoredata/eval/field_extraction_gold.jsonl— gold labels for structured fieldsdata/eval/field_extraction_pred.jsonl— edit me predictions filesrc/evals/text_metrics.py— computes ROUGE-1/2/L + BERTScoresrc/evals/f1_eval.py— computes per-field and macro F1outputs/*.json— metric reports
By default, the playground uses a lightweight semantic fallback (SynonymSim) so you don’t need PyTorch.
If you want true BERTScore:
pip install -r requirements-bert.txt
# then:
make eval-text SEMANTIC=bertIf you want to stick to the no-download path, run:
make eval-text # uses SynonymSim (no torch)macOS/Linux
# creates a venv with python3.11 and installs deps
make setup # ROUGE + SynonymSim (no torch)
make setup-bert # also installs torch + transformers + bert-score
source .venv/bin/activate
make eval-text # default: SEMANTIC=synonym
make eval-text SEMANTIC=bert # requires setup-bertIf you don't have Python 3.11:
- macOS:
brew install python@3.11- Ubuntu:
sudo apt-get install -y python3.11 python3.11-venv
Windows (PowerShell)
# uses 'py -3.11' to create the venv and install deps
powershell -ExecutionPolicy Bypass -File scripts\setup.ps1 # lite (no torch)
powershell -ExecutionPolicy Bypass -File scripts\setup.ps1 -Mode bert # with torch
.\.venv\Scripts\Activate.ps1
make eval-text
# or: make eval-text SEMANTIC=bert