Benchmarking large language models for short answer grading in a fine-grained, multi-subject, and human-aligned setting.
automatic-grading education-nlp short-answer-scoring benchmark-evaluation-llms human-aligned-assessment
-
Updated
May 15, 2025 - Python