Skip to content

Evaluation and Benchmarks: Judging Model Quality

Intuition: no single score captures a model

Section titled “Intuition: no single score captures a model”

Evaluating an LLM is like evaluating a person: you would not judge someone solely by their math score. A model may score high on knowledge QA yet perform poorly on code generation, logical reasoning, or multi-turn conversation. Good evaluation requires comprehensive testing across dimensions, scenarios, and difficulty levels.

Engineering view: choose evaluation matched to your task

Section titled “Engineering view: choose evaluation matched to your task”

In practice, evaluation operates at several levels:

  • Automatic metrics: Perplexity, BLEU, and ROUGE work for tasks with reference answers; code uses pass@k; reasoning uses accuracy. But automatic metrics often diverge from real user experience.
  • LLM-as-a-judge: Use a stronger model to evaluate outputs from weaker models. Benchmarks like MT-Bench adopt this approach. It is low-cost but suffers from position bias, length bias, and self-preference.
  • Human evaluation: The gold standard, but expensive and slow. Commonly used for alignment assessment, creative writing, and open-ended dialogue.
  • Red teaming: Proactively seek failure modes, including jailbreaks, prompt injection, bias, and dangerous content generation.

Common pitfalls include: training data contamination (test sets appearing in pretraining data), prompt sensitivity (scores changing dramatically with rephrasing), and over-optimizing for a single benchmark leading to capability distortion.

At the research level, evaluation itself is a scientific question. HELM proposed a holistic evaluation framework emphasizing comprehensive coverage of scenarios, metrics, and goals. Yet a gap remains between benchmarks and real-world deployment: leading on benchmarks does not guarantee usefulness in products.

Open questions include: how to evaluate emergent abilities? How to quantify the impact of decoding strategies on evaluation results? How to design benchmarks that resist “gaming”? And when models approach or exceed human capabilities, who serves as the judge?

References

  • hendrycks2020-mmlu

    57 subjects with 14K exam questions, since then "grinding MMLU" became de facto standard for measuring LLM general capability. Still first-line metric in model cards even in 2025; see also later MMLU-Pro.

  • liang2022-helm

    Stanford CRFM systematically evaluates 30+ LLMs × multidimensional metrics (accuracy, robustness, fairness, efficiency...), establishing "evaluation science". Representative work against "only looking at average scores".

  • chen2021-humaneval

    Proposes Codex model + HumanEval benchmark (164 programming problems). HumanEval remains "ECG metric" for coding models today; this paper is also root of GitHub Copilot.

  • zheng2023-mtbench

    Proposes GPT-4-as-judge + human preference crowdsourcing (Chatbot Arena) for evaluating dialogue capability. MT-Bench and Arena ELO remain community's de facto dual standards for comparing model "dialogue capability" today.