Evaluation and Benchmarks: Judging Model Quality
Intuition: no single score captures a model
Section titled “Intuition: no single score captures a model”Evaluating an LLM is like evaluating a person: you would not judge someone solely by their math score. A model may score high on knowledge QA yet perform poorly on code generation, logical reasoning, or multi-turn conversation. Good evaluation requires comprehensive testing across dimensions, scenarios, and difficulty levels.
Engineering view: choose evaluation matched to your task
Section titled “Engineering view: choose evaluation matched to your task”In practice, evaluation operates at several levels:
- Automatic metrics: Perplexity, BLEU, and ROUGE work for tasks with reference answers; code uses pass@k; reasoning uses accuracy. But automatic metrics often diverge from real user experience.
- LLM-as-a-judge: Use a stronger model to evaluate outputs from weaker models. Benchmarks like MT-Bench adopt this approach. It is low-cost but suffers from position bias, length bias, and self-preference.
- Human evaluation: The gold standard, but expensive and slow. Commonly used for alignment assessment, creative writing, and open-ended dialogue.
- Red teaming: Proactively seek failure modes, including jailbreaks, prompt injection, bias, and dangerous content generation.
Common pitfalls include: training data contamination (test sets appearing in pretraining data), prompt sensitivity (scores changing dramatically with rephrasing), and over-optimizing for a single benchmark leading to capability distortion.
Research view: evaluation is science
Section titled “Research view: evaluation is science”At the research level, evaluation itself is a scientific question. HELM proposed a holistic evaluation framework emphasizing comprehensive coverage of scenarios, metrics, and goals. Yet a gap remains between benchmarks and real-world deployment: leading on benchmarks does not guarantee usefulness in products.
Open questions include: how to evaluate emergent abilities? How to quantify the impact of decoding strategies on evaluation results? How to design benchmarks that resist “gaming”? And when models approach or exceed human capabilities, who serves as the judge?
References
- hendrycks2020-mmlu
57 subjects with 14K exam questions, since then "grinding MMLU" became de facto standard for measuring LLM general capability. Still first-line metric in model cards even in 2025; see also later MMLU-Pro.
- liang2022-helm
Stanford CRFM systematically evaluates 30+ LLMs × multidimensional metrics (accuracy, robustness, fairness, efficiency...), establishing "evaluation science". Representative work against "only looking at average scores".
- chen2021-humaneval
Proposes Codex model + HumanEval benchmark (164 programming problems). HumanEval remains "ECG metric" for coding models today; this paper is also root of GitHub Copilot.
- zheng2023-mtbench
Proposes GPT-4-as-judge + human preference crowdsourcing (Chatbot Arena) for evaluating dialogue capability. MT-Bench and Arena ELO remain community's de facto dual standards for comparing model "dialogue capability" today.